[
https://issues.apache.org/jira/browse/HDFS-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-4203:
------------------------------
Priority: Minor (was: Major)
> After recoverFileLease, datanode gets stuck complaining block '...has out of
> data GS ....may already be committed'
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-4203
> URL: https://issues.apache.org/jira/browse/HDFS-4203
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 1.1.0
> Reporter: stack
> Priority: Minor
>
> After calling recoverFileLease, an append to a file gets stuck retying this:
> {code}
> 2012-11-16 13:06:14,298 DEBUG [IPC Server handler 2 on 53224]
> namenode.PendingReplicationBlocks(92): Removing pending replication for
> blockblk_-3222397051272483489_1006
> 2012-11-16 13:06:43,881 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3216):
> Error Recovery for block blk_-3222397051272483489_1003 bad datanode[0]
> 127.0.0.1:53228
> 2012-11-16 13:06:43,881 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3267):
> Error Recovery for block blk_-3222397051272483489_1003 in pipeline
> 127.0.0.1:53228, 127.0.0.1:53231: bad datanode 127.0.0.1:53228
> 2012-11-16 13:06:43,884 INFO [IPC Server handler 1 on 53233]
> datanode.DataNode(2123): Client calls
> recoverBlock(block=blk_-3222397051272483489_1003, targets=[127.0.0.1:53231])
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233]
> datanode.FSDataset(2143): Interrupting active writer threads for block
> blk_-3222397051272483489_1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233]
> datanode.FSDataset(2159): getBlockMetaDataInfo successful
> block=blk_-3222397051272483489_1006 length 120559 genstamp 1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233]
> datanode.DataNode(2039): block=blk_-3222397051272483489_1003,
> (length=120559),
> syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_-3222397051272483489_1006
> wasRecoveredOnStartup=false) node=127.0.0.1:53231)], closeFile=false
> 2012-11-16 13:06:43,885 INFO [IPC Server handler 2 on 53224]
> namenode.FSNamesystem(5468): blk_-3222397051272483489_1003 has out of date GS
> 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 2 on 53224]
> security.UserGroupInformation(1139): PriviledgedActionException as:stack
> cause:java.io.IOException: blk_-3222397051272483489_1003 has out of date GS
> 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 1 on 53233]
> security.UserGroupInformation(1139): PriviledgedActionException
> as:blk_-3222397051272483489_1003 cause:org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003
> found 1006, may already be committed
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 2012-11-16 13:06:43,886 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3292):
> Failed recovery attempt #1 from primary datanode 127.0.0.1:53231
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003
> found 1006, may already be committed
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> at org.apache.hadoop.ipc.Client.call(Client.java:1092)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> at $Proxy12.nextGenerationStamp(Unknown Source)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:2059)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2027)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2107)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> at org.apache.hadoop.ipc.Client.call(Client.java:1092)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> at $Proxy17.recoverBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3290)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> Eventually the DFSClient fails with:
> {code}
> 2012-11-16 13:06:53,945 WARN [ZombieLastLogWriterRegionServer]
> hdfs.DFSClient$DFSOutputStream(3918): Error while syncing
> java.io.IOException: Error Recovery for block blk_-3222397051272483489_1003
> failed because recovery from primary datanode 127.0.0.1:53228 failed 6
> times. Pipeline was 127.0.0.1:53228. Aborting...
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3326)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> It is good that it eventually comes back so this issue does not mean it is
> the end of the world. It seems uninterruptible up in the client while it is
> doing the above retrying. It would just be nice if the DN gave up quicker
> when lease is gone.
> Looking in code, it seems that in DataNode#syncBlock, we are doing an
> explicit call against the NN to getNewGenerationStamp and it is this call
> that is failing. Should we be checking we have the file lease first?
> I see this issue in 1.1 and 1.0. It looks like it is also in trunk. I've
> not tried it.
> If someone made suggestion on what should do here, I can try making a patch.
> Thanks.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira