[
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436866#comment-13436866
]
Eli Reisman commented on GIRAPH-246:
------------------------------------
More test this morning. The 246-NEW-FIX-2.patch calls progress() every 10
seconds regardless of variable-length timed waits in waitMsecs as this patch
sets up, or in waitForever() as Jaeho set it up to do in 267 and trunk does
already. I think this is ready to go.
In other news, I think while stress testing this and scaling it up, I might
have found another place progress needs to be called more often: in the netty
channel pipelines handling send and receive during the input superstep as
collections of vertices are sent to their future homes. I will try to get more
instrumented runs in this morning if I can to get more details, but something
weird is going on when a worker is not reading a split but does start to
receive its partition data over Netty that is causing a timeout. I don't know
if that timing is co-incidental but a strange timeout during large-scale runs
is happening consistently on such worker nodes. Often times when I can get log
data on such a timeout, it is not a healthy worker timing out but one where
netty is overwhelmed and it has genuinely died. This might be more appropriate
in another JIRA, or perhaps Avery is already aware of this and has wrapped it
up into his next Netty improvement? Either way I will try to get more details
on what is happening here and repeat the problem. This is running on today's
trunk too, so the GIRAPH-300 improvements are already in as of this problem
showing up.
> Periodic worker calls to context.progress() will prevent timeout on some
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-246
> URL: https://issues.apache.org/jira/browse/GIRAPH-246
> Project: Giraph
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: hadoop, patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch,
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch,
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch,
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch,
> GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch,
> GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to
> control the time between calls to context().progress() that allows workers to
> avoid timeouts during long data load-ins in which some works complete their
> input split reads much faster than others, or finish a super step faster. I
> found this allowed jobs that were large-scale but with low memory overhead to
> complete even when they would previously time out during runs on a Hadoop
> cluster. Timeout is still possible when the worker crashes or runs out of
> memory or has other GC or RPC trouble that is legitimate, but prevents
> unintentional crashes when the worker is actually still healthy.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira