[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436974#comment-13436974
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

I have run several more large-scale test and can now confirm, something else is 
wrong now. I still believe the 246-NEW-FIX-2.patch keeps the calls in the 
predicate lock as avery prefers and performs as well as the old patch did. This 
solves several easy timeout cases and should probably go in so we can get past 
the current situation which is failing at numerous times during different size 
runs.

The bad news: there is a new problem that is effecting all the patches, even 
the rebases. If a worker does not read an input split and gets past 600 seconds 
of "idle time" during INPUT_SUPERSTEP, it will time out. The original patch 
solved this issue here for a month or more, but now none of them solve it.

There is a lot of new code committed, perhaps something has had this side 
effect. Whether the progress calls come from inside the locking code or in 
BspServiceWorker, the fail occurs. The timeout length during the 
INPUT_SUPERSTEP barrier wait does not effect this situation either. Something 
else is going on here now. I have instrumented the code to see that the idle 
workers ARE waking up from the barrier and scanning the list again before going 
to sleep (not far enough to get a split, as GIRAPH-301 is attempting to 
address) but if they don't end up executing split-reading code before the 
timeout arrives, these periodic wake ups are not enough to avoid timeouts. This 
is independent of how many progress() calls occur during said wake-ups.

Because the netty channel pipeline is in its own thread pool, as long as 
BspServiceWorker is calling progress() often enough, lack of calls in Netty 
code should not make any difference. It seems to me now, progress calls from 
inside BspServiceWorker are not making it out. I notice the master has no 
problem in barrier waits longer than the ones these idle workers experience, 
even when it is effectively idle and waiting.

This whole issue has defied simple explanations for a long time now, and I was 
just happy to have a fix that seemed to solve it. But at this point, I would 
really like get to the bottom of this, because it will once again directly 
block us up here. I realize others who have seen this issue are encountering 
problems at other phases of the job workflow, but if anyone has ideas I'd like 
to hear them. It seems like going in the wrong direction for all workers to 
waste a thread to call progress all the time since we share our grid with lots 
of concurrent jobs and threads are at a premium here. Seems like there's a 
simple reason for this problem cropping up. Manually setting the hadoop 
timeouts is also not an option for us here. So a Giraph-internal solution that 
lets bad jobs die and keeps good ones alive is really what we need.

If anyone has further input on this topic, please feel free to jump in, I'd 
like to get this solved for all cases and move forward!




                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, 
> GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, 
> GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to