[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431941#comment-13431941
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

The version of Hadoop we test on is not without its own patches. I agree with 
both of you. This whole thing makes me uncomfortable. I would be most 
comfortable with a fix that we understood the workings of.

Shortly after the discussion I had with Claudio and Maya about the disk spill 
code is the first time I had any impression everyone was not attempting to 
scale this out due to the fact that we ride on Hadoop, BSP has "Bulk" in it 
etc. After that discussion, I realized perhaps others were not seeing the stuff 
that was bothering me because we weren't all testing on the same scale of 
hardware. Since then, as this issue has come up, I have assumed no one was 
verifying this stuff but me.

I am not comfortable being the single source of ground truth on this, 
especially if others feel strongly about it. All along, I assumed there would 
come a point where someone would stop arguing and just run some large jobs on 
some large data to verify at least someone else could either make one of the 
other fixes work, or not.

Now (and for the short but immediate future) none of us are able to schedule 
large jobs and do this (I have been trying, believe me.) I do have a rebased 
patch that works for us. We have been using it for a while now with success. If 
you want to wait and put this in later when others encounter the same problem, 
or if someone is being proactive about fixing the problem soon, I'm totally 
fine with that.

My only horse in this race is getting jobs to run. In the short term, I have a 
way to do that for now with the 246-11 or the other rebase. So I'm starting to 
feel like barring some alternate ground truth coming into the picture and 
settling this, I'm fine handling this however we like, as long as its short 
term window.

                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-10.patch, 
> GIRAPH-246-11.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to