[ 
https://issues.apache.org/jira/browse/HADOOP-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642531#action_12642531
 ] 

Christian Kunz commented on HADOOP-4517:
----------------------------------------

The thread dump was taken long time (about 10 hrs)  after the last log message 
containing above exception for this datanode.

In general, from what I observed for the whole job there was an unsually high 
number of write errors including reduce task failures compared to 0.17.2. In 
the 12 hours leading up to the last exception there were 300+ exceptions like 
above on this datanode alone. I checked a datanode which did not become dead. 
It showed similar order of magnitude of exceptions.

> unstable dfs when running jobs on 0.18.1
> ----------------------------------------
>
>                 Key: HADOOP-4517
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4517
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: hadoop-0.18.1 plus patches HADOOP-4277 HADOOP-4271 
> HADOOP-4326 HADOOP-4314 HADOOP-3914 HADOOP-4318 HADOOP-4351 HADOOP-4395
>            Reporter: Christian Kunz
>         Attachments: datanode.out
>
>
> 2 attempts of a job using 6000 maps, 1900 reduces
> 1.st attempt: failed during reduce phase after 22 hours with 31 dead 
> datanodes most of which became unresponsive due to an exception; dfs lost 
> blocks
> 2nd attempt: failed during map phase after 5 hours with 5 dead datanodes due 
> to exception; dfs lost blocks responsible for job failure.
> I will post typical datanode exception and attach thread dump.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to