[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437621#comment-13437621
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

OK. I have tested GIRAPH-246-NEW-FIX-2.patch over the weekend quite a bit, and 
have been able with instrumented runs to determine that it solves our problems 
as well as the old rebase did. All the issues I posted here friday turned out 
to be memory overloads in the worker or in Netty itself that were not sending 
out logs so I could not be sure of the causes. This patch is good to go as of 
now as far as I'm concerned.

You know, I have the same problems with zookeeper timeouts you might consider 
trying -Dzookeeper.maxSessionTimeout=XXX (where XXX is a large # of millis!) if 
you have more crashes during long jobs. I am very interested to see how the 
runs go for you guys considering your hardware profile and therefore scaling 
strategy is different than ours. Let me know how it goes.

I would still say its worth updating 274 and committing it as-is soon if you 
have trouble with cleanup phase as it will complement this one nicely and cover 
all the pain spots for timeouts yet discovered.

Do you find using the cmd line timeout opts have worked for you there? It is an 
attractive option but we have been forbidden here as it leads to zombie jobs 
when folks forget a run and go home, and ops gets mad. We can't kill each 
other's jobs on our cluster, so its a real concern when one outlives its 
usefulness for many hours! The ideal is a self-contained solution for users 
without forcing them to take care of all this stuff that is external to Giraph. 
If we can handle the timeout issues in code it would be awfully nice, as the 
health heartbeats do have a valuable purpose and Giraph should play well on a 
busy grid such as ours where jobs using Hadoop MR, Pig, Crunch, Hive, etc. must 
all work together simultaneously.


                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, 
> GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, 
> GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to