I like the way you were able to dig down into multiple logs and present us
the information, but it looks more like GC than an HDFS failure. In your
region server log, go back to the first FATAL and see if it got a session
expired from ZK and other messages like a client not being able to talk to
a server for some amount of time. If it's the case then what you are seeing
is the result of IO fencing by the master.

J-D


On Wed, Oct 2, 2013 at 10:15 AM, Ionut Ignatescu
<ionut.ignate...@gmail.com>wrote:

> Hi,
>
> I have a Hadoop&HBase cluster, that runs Hadoop 1.1.2 and HBase 0.94.7.
> I notice an issue that stops normal cluster running.
> My use case: I have several MR jobs that read data from one HBase table in
> map phase and write data in 3 different tables during the reduce phase. I
> create table handler on my own, I don't
> TableOutputFormat. The only way out I found is to restart region server
> deamon on region server with problems.
>
> On namenode:
> cat namenode.2013-10-02 | grep "blk_3136705509461132997_43329"
> Wed Oct 02 13:32:17 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server
> handler 29 on 22700) org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247.
> blk_3136705509461132997_43329
> Wed Oct 02 13:33:38 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server
> handler 13 on 22700) org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_3136705509461132997_43329,
> newgenerationstamp=43366, newlength=40045568, newtargets=[
> 10.81.18.101:50010],
> closeFile=false, deleteBlock=false)
>
> On region server:
> cat regionserver.2013-10-02 | grep "1380720737247"
> Wed Oct 02 13:32:17 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller)
> org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720701436,
> entries=149, filesize=63934833.  for
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN]
> (DataStreamer for file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient:
> Error Recovery for block blk_3136705509461132997_43329 bad datanode[0]
> 10.80.40.176:50010
> Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN]
> (DataStreamer for file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient:
> Error Recovery for block blk_3136705509461132997_43329 in pipeline
> 10.80.40.176:50010, 10.81.111.8:50010, 10.81.18.101:50010: bad datanode
> 10.80.40.176:50010
> Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
> Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
> Wed Oct 02 13:33:44 2013 GMT regionserver 5854-0@datanode1:0 [INFO]
> (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
>
> /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247
> retrying...
>
> cat regionserver.2013-10-02 | grep "1380720737247" | grep 'Could not
> complete' | wc -l
> 5640
>
>
> In datanode logs, that runs on the same host with region server:
> cat datanode.2013-10-02 | grep "blk_3136705509461132997_43329"
> Wed Oct 02 13:32:17 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3)
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_3136705509461132997_43329 src: /10.80.40.176:36721 dest: /
> 10.80.40.176:50010
> Wed Oct 02 13:33:37 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3)
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.80.40.176:50010,
> storageID=DS-812180968-10.80.40.176-50010-1380263000454, infoPort=50075,
> ipcPort=50020): Exception writing block blk_3136705509461132997_43329 to
> mirror 10.81.111.8:50010
> Wed Oct 02 13:33:37 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3)
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_3136705509461132997_43329 java.io.IOException: Connection
> reset by peer
> Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (PacketResponder 2 for Block blk_3136705509461132997_43329)
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_3136705509461132997_43329 2 : Thread is interrupted.
> Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (PacketResponder 2 for Block blk_3136705509461132997_43329)
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_3136705509461132997_43329 terminating
> Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3)
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_3136705509461132997_43329 received exception java.io.IOException:
> Connection reset by peer
>
> On datanode2, called by datanode1 to replicate logs:
> cat datanode.2013-10-02 | grep "blk_3136705509461132997_43329"
> Wed Oct 02 13:32:17 2013 GMT datanode 5986-0@datanode2:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b)
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_3136705509461132997_43329 src: /10.80.40.176:57279 dest: /
> 10.81.111.8:50010
> Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO] (IPC
> Server
> handler 18 on 50020) org.apache.hadoop.hdfs.server.datanode.DataNode:
> Client calls recoverBlock(block=blk_3136705509461132997_43329, targets=[
> 10.81.111.8:50010, 10.81.18.101:50010])
> Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [WARN]
> (PacketResponder 1 for Block blk_3136705509461132997_43329)
> org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in
> BlockReceiver.run():  { java.nio.channels.ClosedChannelException | at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135) |
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:326) | at
>
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> | at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> | at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> | at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> | at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) | at
>
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:135)
> | at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:939)
> | at java.lang.Thread.run(Thread.java:662) }
> Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO]
> (PacketResponder 1 for Block blk_3136705509461132997_43329)
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_3136705509461132997_43329 1 Exception
> java.nio.channels.ClosedChannelException
> Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b)
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_3136705509461132997_43329
> java.nio.channels.ClosedByInterruptException
> Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO]
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b)
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_3136705509461132997_43329 received exception java.io.IOException:
> Interrupted receiveBlock
> Wed Oct 02 13:33:38 2013 GMT datanode 5986-0@datanode2:0 [INFO]
> (PacketResponder 1 for Block blk_3136705509461132997_43329)
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for
> block blk_3136705509461132997_43329 terminating
> Wed Oct 02 13:33:38 2013 GMT datanode 5986-0@datanode2:0 [WARN] (IPC
> Server
> handler 18 on 50020)
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to
> getBlockMetaDataInfo for block (=blk_3136705509461132997_43329) from
> datanode (=10.81.111.8:50010) { java.io.IOException: Block
> blk_3136705509461132997_43329 length is 39980544 does not match block file
> length 40045568 | at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1861)
> | at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:2166)
> | at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1783)
> | at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1972)
> | at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2114)
> | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) | at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> | at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> | at java.lang.reflect.Method.invoke(Method.java:597) | at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) | at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) | at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) | at
> java.security.AccessController.doPrivileged(Native Method) | at
> javax.security.auth.Subject.doAs(Subject.java:396) | at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
> | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) }
>
> For me, this sounds like a HDFS issue, but I cannot find the root cause.
> Any idea is welcome.
> I checked OS file descriptors, HDFS file xcieves, anything looks ok.
>
> Thanks in advance,
> Ionut I.
>

Reply via email to