[
https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562959#action_12562959
]
Arun C Murthy commented on HADOOP-2393:
---------------------------------------
bq. after the latest burst of problems - i am reasonably sure that the
underlying cause for these timeouts are expensive synchronized methods in the
taskTracker:
You are right, thanks for the analysis.
I did make TaskTracker.purgeJob a synchronized method via HADOOP-1461 to solve
a corner-case deadlock, which in-hindsight might have been the wrong way to fix
it.
I need to check if my original analysis in HADOOP-1461 is still valid and if
the alternate fix I've proposed there can be done, watch this space.
> TaskTracker locks up removing job files within a synchronized method
> ---------------------------------------------------------------------
>
> Key: HADOOP-2393
> URL: https://issues.apache.org/jira/browse/HADOOP-2393
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.4
> Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
> Reporter: Joydeep Sen Sarma
> Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown
> reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same
> node are also killed. Looking at the logs and code up to 0.14.3 - it seems
> like the child tasks pings to the task tracker are timed out and the child
> task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0120_r_000001_47 done; removing files.
>
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0120_m_000618_0 done; removing files.
>
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread
> Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
> 24 active threads
>
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to
> essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
> discarded for being too old (21380)
>
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
> discarded for being too old (21380)
>
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
> discarded for being too old (10367)
>
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
> discarded for being too old (10360)
>
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner:
> task_0149_m_000002_1 Child Error
> looking at code, failure of client to ping causes termination:
> else {
>
> // send ping
>
> taskFound = umbilical.ping(taskId);
>
> }
>
> ...
> catch (Throwable t) {
>
> LOG.info("Communication exception: " +
> StringUtils.stringifyException(t));
>
> remainingRetries -=1;
>
> if (remainingRetries == 0) {
>
> ReflectionUtils.logThreadInfo(LOG, "Communication exception",
> 0);
> LOG.warn("Last retry, killing "+taskId);
>
> System.exit(65);
>
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely
> cause) - and i would hate to bump up timeout because of this. Crap.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.