[jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Eli Reisman (JIRA) Wed, 15 Aug 2012 18:02:41 -0700

     [ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eli Reisman updated GIRAPH-246:
-------------------------------

    Attachment: GIRAPH-246-NEW-FIX.patch

This is working for us. My ops fella still says we need to call progress more 
often, but I have tested this several times now and it seems to get past our 
timeout filters. To make sure I wasn't crazy, I ran trunk again after testing 
this, and it timed out again at 10 min. like always. The only major change here 
is:

PredicateLock#waitMsecs() did not call progress during waits at any point, only 
waitForever() did by timing out at MSEC_PERIOD. So waits under waitMsecs (such 
as in INPUT_SUPERSTEP which was often the fail point for us) were not calling 
it in idle workers at the barrier.

This patch only calls condition.await() for set slices of the selected timeout 
during a waitMsecs() call, and then continues if the time is not used up after 
calling progress().

This "fix" makes the call from within the lock state which I don't like, but it 
works, perhaps because:

1. no one else is calling progress() during these barrier waits or there would 
be no problem here to start with, so no problems with thread contention during 
the calls.

2. calls like progress() are idempotent in that nothing rides on the calls 
conflicting, as long as they occur at all.

3. The locking is to protect against asynchronous event state changes, and that 
is not idempotent, so we shouldn't break the try block when the timed 
condition.await() stops to call progress().

If we want to break the try block to call progress at each loop we can, but 
this is already working and keeps the event part safe inside the locked code at 
all times, making minimal change to an important part of the code that already 
works correctly.

I was not able to patch in 291 as it is stale, so I wrote a simple test and I 
figured when 291 is ready, it can replace this code with more comprehensive 
tests if you like.

If someone else could try this out it would make me happy. But I think its good 
to go, passes mvn verify and its new test etc. as well.


                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch, 
> GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Reply via email to