[
https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428693#comment-13428693
]
Eli Reisman commented on GIRAPH-274:
------------------------------------
Problem is, we need those threads for Netty IO, ZK client connections, etc. and
remember often multiple mappers (workers) can share the same hardware node
during a job, and on a cluster other MR jobs can also be sharing those hardware
resources with Giraph on many nodes. The existing progress calls are cheap from
what I was told when I was originally solving this problem. Threads on hardware
are at a premium, at least on our clusters, and have more important things to
do.
With the 246 patch, I ran well into the 100's of millions of vertices and many
billions of edges without ever hitting any trouble with timeouts any more
during healthy jobs. Have you run Jakobs 232 metrics patch during these timeout
jobs? When a job gets overloaded and workers or their Netty buffers get in
trouble due to memory, the timeouts occur as well. Could memory issues be the
reason you're seeing this, and not a lack of progress calls? These are times
when the timeout is actually supposed to happen. Often you'll see in the mapper
detail pages logged in std err some netty complaints about needing an exception
handler. Sometimes you get more detail than that, but not always. That means
you ran out of memory and the worker died in progress, tried to
restart/reconnect, and could not move forward. Anyway his metrics patch will
make it clear as day, worth a look for sure if its not stale.
I am trying to run larger jobs right now, and they all still time out when
running on the current trunk's progress solution. I'm going to put together a
patch for the current progress problem so I can run large jobs again, and then
when you guys come up with a better solution, please post it here and I'll
switch to that.
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
> Key: GIRAPH-274
> URL: https://issues.apache.org/jira/browse/GIRAPH-274
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 0.2.0
> Reporter: Jaeho Shin
> Assignee: Jaeho Shin
> Fix For: 0.2.0
>
> Attachments: GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some
> workers don't get to reserve an input split, while others were loading
> vertices for a long time. (related to GIRAPH-246 and GIRAPH-267)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira