[ 
https://issues.apache.org/jira/browse/HADOOP-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated HADOOP-3132:
---------------------------------

      Description: 
This problem happens in 0.17 trunk

As reported in hadoop-3124,
I saw reducers waited 10 minutes for writing data to dfs and got timeout.
The client retries again and timeouted after another 19 minutes.

During the period of write stuck, all the nodes in the data node pipeline were 
functioning fine.
The system load was normal.
I don't believe this was due to slow network cards/disk drives or overloaded 
machines.
I believe this and hadoop-3033 are related somehow.


  was:

This problem happens in 0.17 trunk

As reported in hadoop-3124,
I saw reducers waited 10 minutes for writing data to dfs and got timeout.
The client retries again and timeouted after another 19 minutes.

During the period of write stuck, all the nodes in the data node pipeline were 
functioning fine.
The system load was normal.
I don't believe this was due to slow network cards/disk drives or overloaded 
machines.
I believe this and hadoop-3033 are related somehow.


         Priority: Major  (was: Blocker)
    Fix Version/s:     (was: 0.17.0)
                   0.18.0

Making this a non-blocker and moving it to 0.18 because :

#It is not fatal. DFS writes, tasks recover from it.
# happens very very rarely. Till now we know only one cluster where this 
happens.
# mostly looks like a bug outside Hadoop and JRE (so may not be present on 
different kernel versions, hardware or os, switches).

Why delay in diagnosis :
hard to reproduce and requires a specific 500 node cluster.

> DFS writes stuck occationally
> -----------------------------
>
>                 Key: HADOOP-3132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Runping Qi
>            Assignee: Raghu Angadi
>             Fix For: 0.18.0
>
>
> This problem happens in 0.17 trunk
> As reported in hadoop-3124,
> I saw reducers waited 10 minutes for writing data to dfs and got timeout.
> The client retries again and timeouted after another 19 minutes.
> During the period of write stuck, all the nodes in the data node pipeline 
> were functioning fine.
> The system load was normal.
> I don't believe this was due to slow network cards/disk drives or overloaded 
> machines.
> I believe this and hadoop-3033 are related somehow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to