[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430802#comment-13430802
 ] 

Avery Ching commented on GIRAPH-246:
------------------------------------

I appreciate your testing Eli, glad to hear that it's working well.

There are only 2 things that I would like to see:

1) The code block I wrote about above

{code}
@@ -286,6 +288,7 @@ public class BspServiceWorker<I extends WritableComparable,
       // an InputSplit has finished.
       getInputSplitsStateChangedEvent().waitMsecs(60 * 1000);
       getInputSplitsStateChangedEvent().reset();
+      getContext().progress();
{code}

2) Can't we just keep PredicateLock using Context, without reverting it?  It's 
not hurting anything and it's a very tiny piece of code that should support 
waitForever() correctly.  The old way of doing waitForever() is wrong without 
the context.progress() getting called internally.

I'm totally +1'ed on everything else and I'd be happy to help review this once 
we resolve the above.
                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to