[jira] [Commented] (GIRAPH-274) Jobs still failing due to tasks timeout during INPUT_SUPERSTEP

Eli Reisman (JIRA) Sat, 04 Aug 2012 15:34:05 -0700

    [ 
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428693#comment-13428693
 ]


Eli Reisman commented on GIRAPH-274:
------------------------------------

Problem is, we need those threads for Netty IO, ZK client connections, etc. and 
remember often multiple mappers (workers) can share the same hardware node 
during a job, and on a cluster other MR jobs can also be sharing those hardware 
resources with Giraph on many nodes. The existing progress calls are cheap from 
what I was told when I was originally solving this problem. Threads on hardware 
are at a premium, at least on our clusters, and have more important things to 
do.

With the 246 patch, I ran well into the 100's of millions of vertices and many 
billions of edges without ever hitting any trouble with timeouts any more 
during healthy jobs. Have you run Jakobs 232 metrics patch during these timeout 
jobs? When a job gets overloaded and workers or their Netty buffers get in 
trouble due to memory, the timeouts occur as well. Could memory issues be the 
reason you're seeing this, and not a lack of progress calls? These are times 
when the timeout is actually supposed to happen. Often you'll see in the mapper 
detail pages logged in std err some netty complaints about needing an exception 
handler. Sometimes you get more detail than that, but not always. That means 
you ran out of memory and the worker died in progress, tried to 
restart/reconnect, and could not move forward. Anyway his metrics patch will 
make it clear as day, worth a look for sure if its not stale.

I am trying to run larger jobs right now, and they all still time out when 
running on the current trunk's progress solution. I'm going to put together a 
patch for the current progress problem so I can run large jobs again, and then 
when you guys come up with a better solution, please post it here and I'll 
switch to that.

                
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
>                 Key: GIRAPH-274
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-274
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Jaeho Shin
>            Assignee: Jaeho Shin
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some 
> workers don't get to reserve an input split, while others were loading 
> vertices for a long time.  (related to GIRAPH-246 and GIRAPH-267)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-274) Jobs still failing due to tasks timeout during INPUT_SUPERSTEP

Reply via email to