[
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426782#comment-13426782
]
Eli Reisman commented on GIRAPH-274:
------------------------------------
Thanks for noticing! The patch I had up called progress in more places than
just around the locks. I have been running large amounts of data all summer at
it takes forever to load. I know it polluted the landscape with progress()
calls, but the alternative was another thread as Avery said here and that
seemed like a worse idea AND allowed for zombies to continue when they had
failed for all intents and purposes. When users played with this idea, our
cluster were occasionally littered with zombies that had been forgotten about
by users when the job seemed to fail. So...
The patch I arrived at in 246 worked fine and only hit a 600 second timeout
when the job was actually catastrophically failed at a particular worker. If
you look through it and add the progress calls your lock patch did not, it will
work. I was able to spend up to 60+ min loading huge social graph data with no
trouble, and finishing jobs. Obviously the next step is to lower that time, but
progress() calls are a must. If you grab those calls, I guarantee it will work
for now as long as you need it to. Its been a while, but I'm fairly sure I
didn't give anyone access to context who didn't already have it also.
Good luck, thanks for addressing this, 246 would no longer patch in and I was
not able to run any large data for a week now, this fix will be welcome!
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
> Key: GIRAPH-274
> URL: https://issues.apache.org/jira/browse/GIRAPH-274
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 0.2.0
> Reporter: Jaeho Shin
> Assignee: Jaeho Shin
> Fix For: 0.2.0
>
> Attachments: GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some
> workers don't get to reserve an input split, while others were loading
> vertices for a long time. (related to GIRAPH-246 and GIRAPH-267)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira