[jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Eli Reisman (JIRA) Tue, 07 Aug 2012 23:04:16 -0700

     [ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eli Reisman updated GIRAPH-246:
-------------------------------

    Attachment: GIRAPH-246-9.patch

I want to make sure the progress call is directly used in lock calls not just 
waitForever(), but not inside the try blocks. This should work, still trying to 
get a moment on the cluster to really ramp it up but so far so good.

>From what Jakob suspected progress calls via Progressable weren't working. I 
>also think after looking at it that we needed a few extra calls where the my 
>old patch had them, and a call within the timed lock call (but not in the 
>lock) as well as in wait forever.

I guess the progress calls have to rack up a while, then the real IPC call to 
Hadoop is made. So calling it  more frequently than it looks wise to is 
apparently not a bad thing. I'm not sure this works yet, the only one I've 
gotten to work for sure is the one I put up earlier, but I'm trying to find a 
marriage of the two that will let my jobs finish and keep the context in the 
Predicate lock as Avery asked. So far this one is looking good but I'll confirm 
tomorrow AM if I get some good longer runs in tonight.


                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Reply via email to