[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck

Eric Zhiqiang Ma (JIRA) Thu, 16 Oct 2014 21:11:44 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174706#comment-14174706
 ]


Eric Zhiqiang Ma commented on HDFS-7180:
----------------------------------------

~brandonli: Not at all and many thanks a lot for the analysis and confirmation!

I checked the log on 10.0.3.176 and found the exception of socket timeout 
between 10.0.3.172 and 10.0.3.176 as follows.

------------------------------
2014-10-02 06:00:07,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 src: 
/10.0.3.172:37334 dest: /10.0.3.176:
50010
2014-10-02 06:00:31,970 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Slow flushOrSync took 24097ms (threshold=300ms), isSync:true, 
flushTotalNanos=9424ns
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, 
type=LAST_IN_PIPELINE, downstream
s=0:[]: Thread is interrupted.
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, 
type=LAST_IN_PIPELINE, downstream
s=0:[] terminating
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 
received exception java.net.SocketTime
outException: 60000 millis timeout while waiting for channel to be ready for 
read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010 
remote=/10.0.3.172:37334]
2014-10-02 06:01:32,093 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
dstore-176:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/10.0.3.172:37334 dst: /10.0.3.176:50
010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
------------------------------

Our trace level is INFO. Hence, I do not have information on the 
nonSequentialWriteInMemory.

BTW: is it possible to disable the reordered writes or decrease its rate 
through configurations of the NFSv3 gateway or the NFS client?

> NFSv3 gateway frequently gets stuck
> -----------------------------------
>
>                 Key: HDFS-7180
>                 URL: https://issues.apache.org/jira/browse/HDFS-7180
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: nfs
>    Affects Versions: 2.5.0
>         Environment: Linux, Fedora 19 x86-64
>            Reporter: Eric Zhiqiang Ma
>            Assignee: Brandon Li
>            Priority: Critical
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
> on one node in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
> seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
> we found is after around 1 day running and several hundreds GBs of data 
> uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is 
> mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
> cache now
> 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not 
> doing static UID/GID mapping because '/etc/nfs.map' does not exist.
> 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
> ReadProcessor read fields took 60062ms (threshold=30000ms); ack: seqno: -2 
> status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
> [10.0.3.172:50010, 10.0.3.176:50010]
> 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: 
> DFSOutputStream ResponseProcessor exception  for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
> java.io.IOException: Bad response ERROR for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from datanode 
> 10.0.3.176:50010
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
> 2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
> for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in 
> pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010
> The logs seems suggest 10.0.3.176 is bad. However, from the `hdfs dfsadmin 
> -report`, all nodes in the cluster seems working.
> Any help will be appreciated. Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck

Reply via email to