[ 
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431955#comment-13431955
 ] 

Eli Reisman commented on GIRAPH-274:
------------------------------------

This whole this is so weird and unpleasant. Please realize whenever I babble 
about this issue (now and in the past) it is to pick seemingly useless nits to 
try to figure out why one works or the other doesn't, it doesn't make any 
sense. With that in mind:

- The versions of Hadoop and the scale we run are different, that could be it. 
I can't go into detail on that count here.

- The ONLY two things fundamentally different in the patches, as I recall, is 
that in your waitMsecs() you do not end up calling progress, but after the 
period of timeout before re-looping in waitForever() you do. Perhaps an 
outside-the-lock-tryblock call to progress in the waitMsecs would help.

Also, I notices during rebase I lowered the timeout to 30 seconds. While this 
is silly at first since the timeout is 10 minutes, until recently I did not 
know that Mapper.Context was logging progress calls and actually only going out 
on the wire occasionally to report "real" progress. This might mean that we 
need to call it more frequently than is intuitive in order to get a "real" 
outgoing call to Hadoop during those long barrier waits? I will take a better 
look at the Hadoop end of this ASAP to check the actual ratio of calls to real 
signals on the wire, but I suspect this could be part of the issue. In this 
case, simply a shorter timeout of 20-30 seconds seemed to solve the problem for 
us here, and perhaps it was because we ended up calling "real" progress out to 
the wire just often enough per 10 min. window by timing out that often.

Again, this is all grasping at straws. If any of it gives you an idea, run with 
it please. I'd like to see this go away I'm surprised this issue has taken on 
such a life of its own. I have a brief window here where more large scale tests 
are not happening, so for the next few days I doubt I will have the chance to 
try to verify any of the alternatives we have all posted here and on 246, so 
that sort of blocks up progress from my end as well. For all I know one of 
these alternatives might really work, as of now only the old patch and the more 
recent rebase (274-alt-1, aka 246-7) have been tested and work as they used to.

                
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
>                 Key: GIRAPH-274
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-274
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Jaeho Shin
>            Assignee: Jaeho Shin
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-274-alt-1.patch, GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some 
> workers don't get to reserve an input split, while others were loading 
> vertices for a long time.  (related to GIRAPH-246 and GIRAPH-267)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to