[ 
https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587915#action_12587915
 ] 

Johan Oskarsson commented on HADOOP-3232:
-----------------------------------------

Here's the second output from iostat -x 30 from a datanode that just lost 
contact.

{noformat}
avg-cpu:  %user   %nice    %sys %iowait   %idle
          94.26    0.00    0.83    0.00    4.91

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda          0.40   0.20 137.45  5.10 2031.86 2680.44  1015.93  1340.22    
33.06     1.93   13.53   6.72  95.73
sdb          0.23   0.03  0.33  1.20  145.02  552.22    72.51   276.11   454.87 
    0.02   10.22   4.78   0.73
sdc          0.50   0.03  0.70  8.76  307.10 7137.72   153.55  3568.86   786.69 
    4.70  496.94   6.90   6.53
sdd          0.47   0.07  0.83  0.53  315.63    4.83   157.81     2.42   234.56 
    0.06   44.39  10.73   1.47
{noformat}

I'm aware that we have quite a lot of small files, it's an issue we're working 
on. I guess we'll have to ramp up the priority.
What you're saying that the block reports are causing this makes sense. I'll 
merge a few of these directories of small files and see if it improves.

Perhaps the difference between 0.15 and 0.16 was just that we did hit it quite 
hard after the upgrade as data had queued up,
I'm going to see how it behaves today.

> Datanodes time out
> ------------------
>
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: hadoop-hadoop-datanode-new.log, 
> hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, 
> hadoop-hadoop-namenode-master2.out
>
>
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions 
> we've often seen in the nn webui that one or two datanodes "last contact" 
> goes from the usual 0-3 sec to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do 
> this at once, as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but 
> looking at the gc log output it doesn't seem to be the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to