[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436077#comment-13436077
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
Actually the calls that made the difference for us here are not in 274 as it
does not modify the PredicateLock. This "246 new fix" patch adds progress calls
to waitMsecs() as well so that regardless of the timeout someone uses it for,
progress is being called occasionally (all condition.awaits() calls are now
timed, see patch)
I agree all along about the (60 * 1000) call, that is why I changed the
waitMsecs in this patch and why I have suspected it for a while now. Problem
is, unlike some other barrier events, we cannot just refactor that into a
waitForever as we do need workers to sleep for a fair amount of time, and then
wake up again to try to read more splits. So the calls either had to be in the
lock or else in that routine itself as in the original 246 call.
Looking at 274 again, I see why it didn't fix it for us; we have not had a
problem with timeouts yet in those areas of the code (the cleanup phase). If
you have, please rebase that patch (274) and lets commit them both so that all
concerned areas are protected.
I think maybe given your different hardware configuration and data loads, we
are both in the same position: we have both been having particular progress
issues at different sections of a job run that were not being seen (or had not
yet become an issue) for the other parties running Giraph.
Anyway I think the two will probably work well together since they apparently
address all of the sections of code that timeout anyone has discovered so far.
If anyone sees stuff they want changed in this please let me know ASAP as I'd
like to get a fix that works of some sort committed quick so I can stop
rebasing the old fix for folks here and keep taking advantage of any spare
moments on our cluster to test other code.
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch,
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch,
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch,
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch,
> GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira