[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414472#comment-13414472
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
Alessandro, thanks for your thoughtful input.
Before pruning out calls to progress() anywhere, I would caution to look into
the much larger issue that any worker failure in Giraph for any reason (even a
healthy worker than simply didn't call progress() often enough) causes a
cascading failure in the whole job. This means the longest-running (most
overloaded) worker in any job can cause the failure of the whole job.
This JIRA is narrowly focused and does not attempt to address those larger
issues, just make minimal changes in order to functionally solve the problem
for now, in the same way it is solved in other parts of the codebase.
So, in regards to GIRAPH-76, great idea, but too comprehensive for what I'm
trying to do as that solution extends into many other areas. Thanks for the
idea though!
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch,
> GIRAPH-246-3.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira