[ 
https://issues.apache.org/jira/browse/HBASE-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658516#action_12658516
 ] 

Thibaut commented on HBASE-1078:
--------------------------------

At least one DFS client node has lots of DataXceiver exceptions. (I had removed 
the hadoop config entry for this and I will set dfs.datanode.max.xcievers to 
2048 and retry).

08/12/22 12:33:07 INFO datanode.DataNode: writeBlock 
blk_2596678729303248869_20816 received exception java.io.EOFException: while 
trying to read 65557 bytes
08/12/22 12:33:07 ERROR datanode.DataNode: 
DatanodeRegistration(192.168.0.7:50010, 
storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
        at java.lang.Thread.run(Thread.java:619)
08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder 
blk_-3816193612223297244_20739 1 Exception java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:836)
        at java.lang.Thread.run(Thread.java:619)

08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder 
blk_-3816193612223297244_20739 1 : Thread is interrupted.
08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder 1 for block 
blk_-3816193612223297244_20739 terminating
08/12/22 12:33:07 INFO datanode.DataNode: writeBlock 
blk_-3816193612223297244_20739 received exception java.io.EOFException: while 
trying to read 65557 bytes
08/12/22 12:33:07 ERROR datanode.DataNode: 
DatanodeRegistration(192.168.0.7:50010, 
storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
        at java.lang.Thread.run(Thread.java:619)
08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder 
blk_5759147435898729286_20763 1 Exception java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:836)
        at java.lang.Thread.run(Thread.java:619)

08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder 
blk_5759147435898729286_20763 1 : Thread is interrupted.
08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder 1 for block 
blk_5759147435898729286_20763 terminating
08/12/22 12:33:10 INFO datanode.DataNode: writeBlock 
blk_5759147435898729286_20763 received exception java.io.EOFException: while 
trying to read 65557 bytes
08/12/22 12:33:10 ERROR datanode.DataNode: 
DatanodeRegistration(192.168.0.7:50010, 
storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
        at java.lang.Thread.run(Thread.java:619)

This might be causing the hbase failure then.


When I try to shut down the cluster, two region servers hang.
Here is the stack trace of one of them, (the other one is identical)

2008-12-22 12:38:16
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00002aab354cac00 nid=0x7af7 waiting on 
condition [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"pool-1-thread-1" prio=10 tid=0x00002aab35a2b400 nid=0x22b0 waiting on 
condition [0x0000000042a52000..0x0000000042a52aa0]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00002aaab554efc0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
        at java.util.concurrent.DelayQueue.take(DelayQueue.java:160)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:582)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:575)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:946)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:906)
        at java.lang.Thread.run(Thread.java:619)

"DestroyJavaVM" prio=10 tid=0x00002aab381c8000 nid=0x2275 waiting on condition 
[0x0000000000000000..0x000000004022ad10]
   java.lang.Thread.State: RUNNABLE

"Low Memory Detector" daemon prio=10 tid=0x00002aab3483c000 nid=0x227e runnable 
[0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00002aab34839800 nid=0x227d waiting on 
condition [0x0000000000000000..0x0000000040a31440]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00002aab34836000 nid=0x227c waiting on 
condition [0x0000000000000000..0x00000000409303c0]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002aab34834c00 nid=0x227b runnable 
[0x0000000000000000..0x0000000040830730]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002aab34628400 nid=0x227a in Object.wait() 
[0x000000004072f000..0x000000004072fa20]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x00002aaab535e3f8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00002aab34627000 nid=0x2279 in 
Object.wait() [0x000000004062e000..0x000000004062eda0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x00002aaab5290178> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00002aab34621c00 nid=0x2278 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004011e000 nid=0x2276 
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004011f400 nid=0x2277 
runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aab3483dc00 nid=0x227f waiting on 
condition

JNI global references: 1001



There is nothing suspicious in the log file, 
it just ends with:
2008-12-22 12:26:16,117 INFO  [regionserver/0:0:0:0:0:0:0:0:60020] 
regionserver.HRegionServer(508): telling master that region server is shutting 
down at: 192.168.0.7:60020
2008-12-22 12:26:16,122 INFO  [regionserver/0:0:0:0:0:0:0:0:60020] 
regionserver.HRegionServer(515): stopping server at: 192.168.0.7:60020
2008-12-22 12:26:16,545 INFO  [regionserver/0:0:0:0:0:0:0:0:60020] 
regionserver.HRegionServer(525): regionserver/0:0:0:0:0:0:0:0:60020 exiting


> "java.io.IOException: Could not obtain block": allthough file is there and 
> accessible through the dfs client
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1078
>                 URL: https://issues.apache.org/jira/browse/HBASE-1078
>             Project: Hadoop HBase
>          Issue Type: Bug
>         Environment: hadoop 0.19.0
> hbase 0.19.0-dev, r728134
>            Reporter: Thibaut
>         Attachments: errorlogs.zip
>
>
> Hi,
> after doing some more stress testing, my cluster did just stopped working. 
> The regionserver reponsible for the ROOT region can't read a block related to 
> the root region, but it is definitely there as I can read the file through 
> the dfs client.
> All new clients fail to start:
> java.io.IOException: java.io.IOException: Could not obtain block: 
> blk_-3504243288385983835_18732 
> file=/hbase/-ROOT-/70236052/info/mapfiles/780254459775584115/data
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1593)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1895)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1925)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>         at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromMapFile(HStore.java:1709)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1681)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1072)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1466)
>         at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:894)
>         at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at 
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:95)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:550)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:450)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:422)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:559)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:454)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:415)
>         at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:113)
>         at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:96)
> Clients that are still connected do still work. (As they have probably cached 
> the ROOT region?)
> This seemed to have happend after one of the region servers (number 3) shut 
> itselfdown due to exceptions (EOFileException, Unable to create new block, 
> etc... see logfile) The ROOT region then probably got moved to region server 
> 2.
> I attached the logs (DEBUG enabled) of the hdfs namenode, the hbase master 
> node and the two log files of regionserver 2 and 3.
> The filesystem is in healthy state. I can also download the file through the 
> hadoop fs command without any problem and without getting an error message 
> about missing blocks.
> Status: HEALTHY
>  Total size:    142881532319 B (Total open files size: 12415139840 B)
>  Total dirs:    4153
>  Total files:   3541 (Files currently being written: 106)
>  Total blocks (validated):      5208 (avg. block size 27435010 B) (Total open 
> file blocks (not validated): 205)
>  Minimally replicated blocks:   5208 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    4
>  Average block replication:     4.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          7
>  Number of racks:               1
> The filesystem under path '/' is HEALTHY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to