[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eli Reisman updated GIRAPH-246:
-------------------------------
Attachment: GIRAPH-246-7_rebase1.patch
This is patch is a rebasing of the "revert 267" code, its not meant for
inclusion as the general consensus is we want to keep the PredicateLock code in
the codebase when we get a final fix for this. I am uploading for convenience
as people here needed a rebase of this. I still have no idea why this one seems
to work for us.
In general, as our ops guy noticed in like 3 minutes of watching a run (that
later timed out :( ), for one reason or another regardless of approach, Giraph
is clearly not sending enough progress messages to Hadoop despite a fair amount
of calls.
This observation was made under trunk with the currently implemented
PredicateLock calls, but as Avery mentioned this should not matter. I did
mention a few theories on the GIRAPH-274 thread regarding this. Chief among
them is that when you call context.progress() my understanding is it logs the
calls, and only actually calls out to Hadoop after some fixed number of them
have been logged. Given this, perhaps we should have PredicateLock's
waitMsecs() and waitForever() both break looks ever so often and re-call
progress so that in a given 10 minute (the Hadoop timeout default) window we
get enough in to force a "real" call on the wire to Hadoop.
I have had a lot of trouble finding a clear moment to run any of these
candidates as our clusters have been piled up this last week, but I look
forward to trying more of these and hopefully finding one that retains the
PredicateLock code and also does not time out.
If I make any progress, I'll throw another patch on the pile here. Ideas
welcome on this one. Calling all Hadoop MR specialists...
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch,
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch,
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch,
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch,
> GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira