[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437114#comment-13437114
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

I think I have tracked down the issues to something internal that was going on 
while I was doing the runs earlier, repeated re-tests on our rebase that has 
worked for a month and the 246-NEW-FIX-2 that keeps all the predicate lock code 
from Jaeho's fix as described earlier work as normal again. Given all the 
weirdness on the cluster the last few days, lets wait on doing any thing with 
these until monday and I will run many more jobs over the weekend to make 
absolutely certain whatever was going on is internal and not a problem with the 
patches for sure. I should have know when our rebase failed today something 
more odd than Giraph problems was happening. But let me make absolutely certain 
today and this weekend before going any further. Sorry about all the comments 
here, I was under the impression this might get committed today after the 
earlier testing success and i wanted to make sure that didn't happen until I 
tracked this problem down when I got here today and neither patch was behaving 
properly any more.

Jaeho, you had mentioned wanting to try the extra thread option. Since I'll be 
running these two patches a lot this weekend, if you want to put some code up 
on that here or under your other post on this issue, I'd be happy to try it out 
for you if you like.

                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, 
> GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, 
> GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to