[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431514#comment-13431514
 ] 

Avery Ching commented on GIRAPH-246:
------------------------------------

I have to admit that I feel a little uncomfortable with this just because there 
is no reason that waitForever() calling progress() shouldn't do the right 
thing.  It would be more efficient to rely on waitForever() rather that 
iterating the loops because those will hit ZooKeeper in several cases (putting 
additional load on the system in general).  That being said, I appreciate Eli's 
large-scale testing and actual experience.  So if we do commit this, it will be 
because we trust Eli's testing over actual code, because as far as I know, 
there is no technical reason why this change should ensure progress better than 
what exists. 

I also have a minor comment.

Can we not break on the '.'?

{code}
+        getPartitionAssignmentsReadyChangedEvent()
+          .waitMsecs(awaitProgressMsecs);
{code}
                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-10.patch, 
> GIRAPH-246-11.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to