[ 
https://issues.apache.org/jira/browse/HDFS-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-4203:
------------------------------

    Priority: Minor  (was: Major)
    
> After recoverFileLease, datanode gets stuck complaining block '...has out of 
> data GS ....may already be committed'
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4203
>                 URL: https://issues.apache.org/jira/browse/HDFS-4203
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.1.0
>            Reporter: stack
>            Priority: Minor
>
> After calling recoverFileLease, an append to a file gets stuck retying this:
> {code}
> 2012-11-16 13:06:14,298 DEBUG [IPC Server handler 2 on 53224] 
> namenode.PendingReplicationBlocks(92): Removing pending replication for 
> blockblk_-3222397051272483489_1006
> 2012-11-16 13:06:43,881 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3216): 
> Error Recovery for block blk_-3222397051272483489_1003 bad datanode[0] 
> 127.0.0.1:53228
> 2012-11-16 13:06:43,881 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3267): 
> Error Recovery for block blk_-3222397051272483489_1003 in pipeline 
> 127.0.0.1:53228, 127.0.0.1:53231: bad datanode 127.0.0.1:53228
> 2012-11-16 13:06:43,884 INFO  [IPC Server handler 1 on 53233] 
> datanode.DataNode(2123): Client calls 
> recoverBlock(block=blk_-3222397051272483489_1003, targets=[127.0.0.1:53231])
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] 
> datanode.FSDataset(2143): Interrupting active writer threads for block 
> blk_-3222397051272483489_1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] 
> datanode.FSDataset(2159): getBlockMetaDataInfo successful 
> block=blk_-3222397051272483489_1006 length 120559 genstamp 1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] 
> datanode.DataNode(2039): block=blk_-3222397051272483489_1003, 
> (length=120559), 
> syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_-3222397051272483489_1006
>  wasRecoveredOnStartup=false) node=127.0.0.1:53231)], closeFile=false
> 2012-11-16 13:06:43,885 INFO  [IPC Server handler 2 on 53224] 
> namenode.FSNamesystem(5468): blk_-3222397051272483489_1003 has out of date GS 
> 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 2 on 53224] 
> security.UserGroupInformation(1139): PriviledgedActionException as:stack 
> cause:java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 
> 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 1 on 53233] 
> security.UserGroupInformation(1139): PriviledgedActionException 
> as:blk_-3222397051272483489_1003 cause:org.apache.hadoop.ipc.RemoteException: 
> java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 
> found 1006, may already be committed
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 2012-11-16 13:06:43,886 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 
> block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3292): 
> Failed recovery attempt #1 from primary datanode 127.0.0.1:53231
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: 
> java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 
> found 1006, may already be committed
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1092)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>       at $Proxy12.nextGenerationStamp(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:2059)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2027)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2107)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1092)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>       at $Proxy17.recoverBlock(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3290)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> Eventually the DFSClient fails with:
> {code}
> 2012-11-16 13:06:53,945 WARN  [ZombieLastLogWriterRegionServer] 
> hdfs.DFSClient$DFSOutputStream(3918): Error while syncing
> java.io.IOException: Error Recovery for block blk_-3222397051272483489_1003 
> failed  because recovery from primary datanode 127.0.0.1:53228 failed 6 
> times.  Pipeline was 127.0.0.1:53228. Aborting...
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3326)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> It is good that it eventually comes back so this issue does not mean it is 
> the end of the world.  It seems uninterruptible up in the client while it is 
> doing the above retrying.  It would just be nice if the DN gave up quicker 
> when lease is gone.
> Looking in code, it seems that in DataNode#syncBlock, we are doing an 
> explicit call against the NN to getNewGenerationStamp and it is this call 
> that is failing.  Should we be checking we have the file lease first?
> I see this issue in 1.1 and 1.0.  It looks like it is also in trunk.  I've 
> not tried it.
> If someone made suggestion on what should do here, I can try making a patch.  
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to