[ 
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069874#comment-16069874
 ] 

fanqiushi commented on HDFS-7915:
---------------------------------

Hi Colin P. McCabe , I find a similar error on hadoop 2.5.2, i wonder if it is 
the same with this jira:
when I use hbase on hadoop, sometimes the read and write on hdfs become very 
slow, and some errors are found on both hbase log and datanode log.
hbase log:
2017-06-28 04:59:16,206 WARN org.apache.hadoop.hdfs.BlockReaderFactory: 
BlockReaderFactory(fileName=/hyperbase1/data/default/DW_TB_PROTOCOL_201726/693313b31f0db7ffaa7b818a82f75f05/d/baa8cfbf62ac49578829a90c1f8b9603,
 block=BP-673187784-166.0.8.10-1489391820899:blk_1265649740_191912424): I/O 
error requesting file descriptors.  Disabling domain socket 
DomainSocket(fd=3830,path=/var/run/hdfs1/dn_socket)
java.net.SocketTimeoutException: read(2) error: Resource temporarily unavailable
        at org.apache.hadoop.net.unix.DomainSocket.readArray0(Native Method)
        at 
org.apache.hadoop.net.unix.DomainSocket.access$000(DomainSocket.java:45)
        at 
org.apache.hadoop.net.unix.DomainSocket$DomainInputStream.read(DomainSocket.java:532)
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1998)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.requestNewShm(DfsClientShmManager.java:169)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:261)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:432)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1015)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:450)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:782)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:716)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:396)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:304)
        at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:832)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:881)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:563)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1215)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:387)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.readNextDataBlock(HFileReaderV2.java:642)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1102)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:273)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
        at 
org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:257)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:697)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:223)
        at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:76)
        at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109)
        at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1131)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1528)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:494)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:59:16,207 WARN 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0x20a5dc77): failed to load 
1265649740_BP-673187784-166.0.8.10-1489391820899
2017-06-28 04:59:16,208 DEBUG 
org.apache.hadoop.hbase.regionserver.compactions.Compactor: Compaction 
progress: 129941931/6104366 (2128.67%), rate=2293.77 kB/sec
2017-06-28 04:59:22,875 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: 
Detected pause in JVM or host machine (eg GC): pause of approximately 1680ms
GC pool 'ParNew' had collection(s): count=1 time=1846ms
2017-06-28 04:59:31,969 ERROR 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0x20a5dc77): failed to release short-circuit shared memory 
slot Slot(slotIdx=56, shm=DfsClientShm(36b4864aeb6f612be0f3420acc48c2af)) by 
sending ReleaseShortCircuitAccessRequestProto to /var/run/hdfs1/dn_socket.  
Closing shared memory segment.
java.io.IOException: ERROR_INVALID: there is no shared memory segment 
registered with shmId 36b4864aeb6f612be0f3420acc48c2af
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:208)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:59:31,969 WARN 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager: 
EndpointShmManager(166.0.8.33:50010, parent=ShortCircuitShmManager(0993730b)): 
error shutting down shm: got IOException calling shutdown(SHUT_RDWR)
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
        at 
org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:387)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.shutdown(DfsClientShmManager.java:378)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:59:32,368 ERROR 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0x20a5dc77): failed to release short-circuit shared memory 
slot Slot(slotIdx=27, shm=DfsClientShm(c59bc3c02472ae764b297761a88984c7)) by 
sending ReleaseShortCircuitAccessRequestProto to /var/run/hdfs1/dn_socket.  
Closing shared memory segment.
java.io.IOException: ERROR_INVALID: there is no shared memory segment 
registered with shmId c59bc3c02472ae764b297761a88984c7
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:208)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:59:32,368 WARN 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager: 
EndpointShmManager(166.0.8.33:50010, parent=ShortCircuitShmManager(0993730b)): 
error shutting down shm: got IOException calling shutdown(SHUT_RDWR)
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
        at 
org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:387)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.shutdown(DfsClientShmManager.java:378)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:59:36,948 INFO org.mortbay.log: 
org.mortbay.io.nio.SelectorManager$SelectSet@21941784 JVM BUG(s) - injecting 
delay2 times
2017-06-28 04:59:36,948 INFO org.mortbay.log: 
org.mortbay.io.nio.SelectorManager$SelectSet@21941784 JVM BUG(s) - recreating 
selector 2 times, canceled keys 52 times


datanode log:
2017-06-28 04:48:27,157 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for BP-673187784-166.0.8.10-1489391820899:blk_1266878404_193141813
java.net.SocketTimeoutException: 120000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/166.0.8.33:50010 remote=/166.0.8.33:52712]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:48:27,158 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-673187784-166.0.8.10-1489391820899:blk_1266878404_193141813, 
type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1115)
        at java.lang.Thread.run(Thread.java:745)
2017-06-28 04:48:27,208 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for BP-673187784-166.0.8.10-1489391820899:blk_1266878380_193141789

I have two  questions:
(1)  How did this error happen?
(2) Can patch  on this jira solve this problem?

looking forward to your reply

> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin P. McCabe
>            Assignee: Colin P. McCabe
>              Labels: 2.6.1-candidate
>             Fix For: 2.7.0, 2.6.1, 3.0.0-alpha1
>
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch, 
> HDFS-7915.004.patch, HDFS-7915.005.patch, HDFS-7915.006.patch, 
> HDFS-7915.branch-2.6.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error.  In 
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first 
> part (mark the slot as used) and fail at the second part (tell the DFSClient 
> what it did). The "try" block for unregistering the slot only covers a 
> failure in the first part, not the second part. In this way, a divergence can 
> form between the views of which slots are allocated on DFSClient and on 
> server.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to