[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431941#comment-13431941
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
The version of Hadoop we test on is not without its own patches. I agree with
both of you. This whole thing makes me uncomfortable. I would be most
comfortable with a fix that we understood the workings of.
Shortly after the discussion I had with Claudio and Maya about the disk spill
code is the first time I had any impression everyone was not attempting to
scale this out due to the fact that we ride on Hadoop, BSP has "Bulk" in it
etc. After that discussion, I realized perhaps others were not seeing the stuff
that was bothering me because we weren't all testing on the same scale of
hardware. Since then, as this issue has come up, I have assumed no one was
verifying this stuff but me.
I am not comfortable being the single source of ground truth on this,
especially if others feel strongly about it. All along, I assumed there would
come a point where someone would stop arguing and just run some large jobs on
some large data to verify at least someone else could either make one of the
other fixes work, or not.
Now (and for the short but immediate future) none of us are able to schedule
large jobs and do this (I have been trying, believe me.) I do have a rebased
patch that works for us. We have been using it for a while now with success. If
you want to wait and put this in later when others encounter the same problem,
or if someone is being proactive about fixing the problem soon, I'm totally
fine with that.
My only horse in this race is getting jobs to run. In the short term, I have a
way to do that for now with the 246-11 or the other rebase. So I'm starting to
feel like barring some alternate ground truth coming into the picture and
settling this, I'm fine handling this however we like, as long as its short
term window.
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-1.patch, GIRAPH-246-10.patch,
> GIRAPH-246-11.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch,
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch,
> GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira