[
https://issues.apache.org/jira/browse/HADOOP-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590611#action_12590611
]
Raghu Angadi commented on HADOOP-3132:
--------------------------------------
tcpdump on the sender (second datanode in the pipeline) shows the TCP
connection was stuck because of a missing packet. Retransmission of the missing
packet does not seem to be accepted by the receiver (might be because of wrong
checksum, did not capture traffic on the receiver, will try next time).
I got the traffic for last 3-4 minutes on the sender before connection was
broken. This explains all the observations :
# sender has lot of data in its 'sendbuf'
# receiver has a lot of data in its 'recvbuf', but DataNode is blocked in this
socket's read.
# after 16 minutes or so sender's write fails with 'connect timeout' exception.
The missing packet is also confirmed by the fact that every packet from the
remote side has (tcp option) SACK data with "1448-31332 (relative values)".
This implies the receiver is missing first 1448 bytes from the acked seqno.
There are two retransmissions of this missing packets in the capture (2 min
apart). ethereal says that checksum is incorrect (not sure if this is
dependable since we are not sure if checksum is offloaded etc). But in both
cases the packet has same wrong value though it needs to be different because
of different TCP headers. Traffic on receiver would make this more clear.
In any case, this is not an application bug.
> DFS writes stuck occationally
> -----------------------------
>
> Key: HADOOP-3132
> URL: https://issues.apache.org/jira/browse/HADOOP-3132
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Reporter: Runping Qi
> Assignee: Raghu Angadi
> Fix For: 0.18.0
>
>
> This problem happens in 0.17 trunk
> As reported in hadoop-3124,
> I saw reducers waited 10 minutes for writing data to dfs and got timeout.
> The client retries again and timeouted after another 19 minutes.
> During the period of write stuck, all the nodes in the data node pipeline
> were functioning fine.
> The system load was normal.
> I don't believe this was due to slow network cards/disk drives or overloaded
> machines.
> I believe this and hadoop-3033 are related somehow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.