[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430811#comment-13430811
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
Sure if thats what you want. I am not wedded to this solution, I am just
nervous about having to miss testing windows while we figure out why the lock
fix isn't working a 2nd time. I will attempt to repair that solution and try it
out, barring that I can at least leave the context object available and
eliminate the Progressable interface, which apparently is not actually calling
Context. I'll have it up tonight/tomorrow.
BTW I realize we are pushing for a release, I am not sure of your window, but I
am letting all these patches that have built up get committed and I can then
swoop in with 214 (the GiraphConf) as the dust settles. I haven't forgotten
about it!
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch,
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch,
> GIRAPH-246-6.patch, GIRAPH-246-7.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira