[
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174706#comment-14174706
]
Eric Zhiqiang Ma commented on HDFS-7180:
----------------------------------------
~brandonli: Not at all and many thanks a lot for the analysis and confirmation!
I checked the log on 10.0.3.176 and found the exception of socket timeout
between 10.0.3.172 and 10.0.3.176 as follows.
------------------------------
2014-10-02 06:00:07,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 src:
/10.0.3.172:37334 dest: /10.0.3.176:
50010
2014-10-02 06:00:31,970 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Slow flushOrSync took 24097ms (threshold=300ms), isSync:true,
flushTotalNanos=9424ns
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception for BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643,
type=LAST_IN_PIPELINE, downstream
s=0:[]: Thread is interrupted.
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643,
type=LAST_IN_PIPELINE, downstream
s=0:[] terminating
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
received exception java.net.SocketTime
outException: 60000 millis timeout while waiting for channel to be ready for
read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010
remote=/10.0.3.172:37334]
2014-10-02 06:01:32,093 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
dstore-176:50010:DataXceiver error processing WRITE_BLOCK operation src:
/10.0.3.172:37334 dst: /10.0.3.176:50
010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
------------------------------
Our trace level is INFO. Hence, I do not have information on the
nonSequentialWriteInMemory.
BTW: is it possible to disable the reordered writes or decrease its rate
through configurations of the NFSv3 gateway or the NFS client?
> NFSv3 gateway frequently gets stuck
> -----------------------------------
>
> Key: HDFS-7180
> URL: https://issues.apache.org/jira/browse/HDFS-7180
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: nfs
> Affects Versions: 2.5.0
> Environment: Linux, Fedora 19 x86-64
> Reporter: Eric Zhiqiang Ma
> Assignee: Brandon Li
> Priority: Critical
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway
> on one node in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS
> seems working well. (hdfds dfs -ls and etc. works just well). The last stuck
> we found is after around 1 day running and several hundreds GBs of data
> uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is
> mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated
> user map size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated
> group map size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx:
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update
> cache now
> 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not
> doing static UID/GID mapping because '/etc/nfs.map' does not exist.
> 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated
> user map size: 35
> 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated
> group map size: 54
> 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow
> ReadProcessor read fields took 60062ms (threshold=30000ms); ack: seqno: -2
> status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets:
> [10.0.3.172:50010, 10.0.3.176:50010]
> 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception for block
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
> java.io.IOException: Bad response ERROR for block
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from datanode
> 10.0.3.176:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
> 2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
> for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in
> pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010
> The logs seems suggest 10.0.3.176 is bad. However, from the `hdfs dfsadmin
> -report`, all nodes in the cluster seems working.
> Any help will be appreciated. Thanks in advance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)