[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435694#comment-13435694
 ] 

Jaeho Shin commented on GIRAPH-246:
-----------------------------------

I believe source of the timeout with trunk in that case is the lacking 
progress() call around the {getInputSplitsStateChangedEvent().waitMsecs(60 * 
1000);} line in BspServiceWorker.  We saw the same problem of idle workers 
timing out during INPUT_SUPERSTEP, and solved it with GIRAPH-274, which also 
has the progress() call at the same spot as your new fix.  Another important 
progress() calls we added were during saveVertices().  I remember we saw some 
timeouts during CLEANUPs.  So, Eli, how about including other progress() calls 
from GIRAPH-274 as well, and leaving that one open for an alternative way to 
this timeout problem (e.g., optional progress thread?).

It seems we all are now happy with the waitForever()s and PredicateLock became 
even more safe over the course of our long discussion.
                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch, 
> GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to