[
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431955#comment-13431955
]
Eli Reisman commented on GIRAPH-274:
------------------------------------
This whole this is so weird and unpleasant. Please realize whenever I babble
about this issue (now and in the past) it is to pick seemingly useless nits to
try to figure out why one works or the other doesn't, it doesn't make any
sense. With that in mind:
- The versions of Hadoop and the scale we run are different, that could be it.
I can't go into detail on that count here.
- The ONLY two things fundamentally different in the patches, as I recall, is
that in your waitMsecs() you do not end up calling progress, but after the
period of timeout before re-looping in waitForever() you do. Perhaps an
outside-the-lock-tryblock call to progress in the waitMsecs would help.
Also, I notices during rebase I lowered the timeout to 30 seconds. While this
is silly at first since the timeout is 10 minutes, until recently I did not
know that Mapper.Context was logging progress calls and actually only going out
on the wire occasionally to report "real" progress. This might mean that we
need to call it more frequently than is intuitive in order to get a "real"
outgoing call to Hadoop during those long barrier waits? I will take a better
look at the Hadoop end of this ASAP to check the actual ratio of calls to real
signals on the wire, but I suspect this could be part of the issue. In this
case, simply a shorter timeout of 20-30 seconds seemed to solve the problem for
us here, and perhaps it was because we ended up calling "real" progress out to
the wire just often enough per 10 min. window by timing out that often.
Again, this is all grasping at straws. If any of it gives you an idea, run with
it please. I'd like to see this go away I'm surprised this issue has taken on
such a life of its own. I have a brief window here where more large scale tests
are not happening, so for the next few days I doubt I will have the chance to
try to verify any of the alternatives we have all posted here and on 246, so
that sort of blocks up progress from my end as well. For all I know one of
these alternatives might really work, as of now only the old patch and the more
recent rebase (274-alt-1, aka 246-7) have been tested and work as they used to.
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
> Key: GIRAPH-274
> URL: https://issues.apache.org/jira/browse/GIRAPH-274
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 0.2.0
> Reporter: Jaeho Shin
> Assignee: Jaeho Shin
> Fix For: 0.2.0
>
> Attachments: GIRAPH-274-alt-1.patch, GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some
> workers don't get to reserve an input split, while others were loading
> vertices for a long time. (related to GIRAPH-246 and GIRAPH-267)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira