[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436179#comment-13436179
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
did 3 very long runs on this patch now. The very last one had several workers
more than there were splits to be sure that a select few workers never at any
point did anything but sit in waitForever() during a very long data load in
(20+ min). Right at the 10 min. timeout, they died.
Sometimes when this sort of timeout happens on INPUT_SUPERSTEP its because
netty is overloaded and more runs and checking lots of mapper detail pages
reveals a log of a timed-out worker that reveals this. But these were workers
that were verified in said log not to get a split at all, so they just sat in
the barrier, and only had incoming data from netty for their partitions, should
have been no big deal since 4 figures of other workers that did read splits and
send/get data at the same time had no timeout or crash problems.
It seems again that the only way to ensure the barrier doesn't have progress
trouble is to call it outside of the locked state in PredicateLock. Let me try
this and see if we get the results we need, and re-run the current test to see
if this happens again. I hope it was a fluke, this needs to be over with!
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch,
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch,
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch,
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch,
> GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira