[jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Eli Reisman (JIRA) Tue, 07 Aug 2012 23:23:16 -0700

    [ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430897#comment-13430897
 ]


Eli Reisman commented on GIRAPH-246:
------------------------------------

FYI Jaeho, you did great work I hope you don't think I'm picking on it. My 
reasons for reverting it are out of desperation:

I had a fix for this that worked and was able to test other code knowing the 
timeout was not a problem on large data loads (which is my whole focus right 
now.) Once the 267 patch went it, I could not move forward until it was fixed. 
I was able to move forward again starting last weekend when I rebased my old 
patch, and the problem went away. I have a bunch of new code to test and can't 
keep fighting the old problem. By getting the old patch in and reverting the 
new one, I will be able to keep using it without having to rebase it every time 
trunk is upgraded or I want to patch in new code to test. I figured with the 
pressure off you to solve this quickly, I could run new code and large data 
knowing it won't time out and you have time to get the fix you want up and 
running, and then happily replace my ugly one when you know whats wrong.

I'm hoping I can get this compromise patch to run tonight, but my cluster is 
busy and I'm having trouble getting scheduled on even a small job. I will keep 
trying and hopefully get an answer. If this works, then we can keep the 
predicate lock idea and I can still get my jobs to run without timing out. 
Thats all I'm looking for at this point.


                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Reply via email to