[ 
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426782#comment-13426782
 ] 

Eli Reisman commented on GIRAPH-274:
------------------------------------

Thanks for noticing! The patch I had up called progress in more places than 
just around the locks. I have been running large amounts of data all summer at 
it takes forever to load. I know it polluted the landscape with progress() 
calls, but the alternative was another thread as Avery said here and that 
seemed like a worse idea AND allowed for zombies to continue when they had 
failed for all intents and purposes. When users played with this idea, our 
cluster were occasionally littered with zombies that had been forgotten about 
by users when the job seemed to fail. So...

The patch I arrived at in 246 worked fine and only hit a 600 second timeout 
when the job was actually catastrophically failed at a particular worker. If 
you look through it and add the progress calls your lock patch did not, it will 
work. I was able to spend up to 60+ min loading huge social graph data with no 
trouble, and finishing jobs. Obviously the next step is to lower that time, but 
progress() calls are a must. If you grab those calls, I guarantee it will work 
for now as long as you need it to. Its been a while, but I'm fairly sure I 
didn't give anyone access to context who didn't already have it also.

Good luck, thanks for addressing this, 246 would no longer patch in and I was 
not able to run any large data for a week now, this fix will be welcome!

                
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
>                 Key: GIRAPH-274
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-274
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Jaeho Shin
>            Assignee: Jaeho Shin
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some 
> workers don't get to reserve an input split, while others were loading 
> vertices for a long time.  (related to GIRAPH-246 and GIRAPH-267)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to