[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391622#comment-16391622
 ] 

Kihwal Lee commented on HDFS-13236:
-----------------------------------

For block placement differences, {{DFSNetworkTopology}} is new in Hadoop 3. It 
might be related. See HDFS-11419. [~vagarychen] might be able to tell whether 
it is related.

Missing Client ID is very strange as it has a call ID. Both come from the 
handler's thread local variable set by the RPC server. The client ID field  
isn't the last field in the edit either. It is created when a RPC client is 
created and this is set in the connection context header. It is all internal 
and automatic. I don't know what happens when the connection is dropped before 
edit logging, while the call is still being processed. [~daryn], does server 
reset the client ID in this case?


> Standby NN down with error encountered while tailing edits
> ----------------------------------------------------------
>
>                 Key: HDFS-13236
>                 URL: https://issues.apache.org/jira/browse/HDFS-13236
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: journal-node, namenode
>    Affects Versions: 3.0.0
>            Reporter: Yuriy Malygin
>            Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_00000000012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_0000000001274897875, 
> cpktTxId
> =0000000001274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&s
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848&inProgressOk=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_000098_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_000098_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:946)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,688 
> FATAL [Edit log tailer] EditLogTailer - Unknown error encountered while 
> tailing edits. Shutting down standby NN.
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: java.io.IOException: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1941)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: Caused by: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1939)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: ... 10 more
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,708  
> INFO [Edit log tailer] ExitUtil - Exiting with status 1: java.io.IOException: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,722 
> INFO [pool-1-thread-1] NameNode - SHUTDOWN_MSG:
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> /************************************************************
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: SHUTDOWN_MSG: Shutting down 
> NameNode at srvd2135.local/10.137.2.39
> {code}
>  
> Additional information about errors after update to 3.0.0:
> In active NN logs many messages about replication errors (caused by Rack 
> Awareness):
> {code:java}
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 2018-02-28 01:57:20,804 WARN 
> [IPC Server handler 10 on 8020] PmsRackMapping - Got empty rack for 
> 10.136.2.149, reverting to default.
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 2018-02-28 01:57:20,806 DEBUG 
> [IPC Server handler 10 on 8020] BlockPlacementPolicy - Failed to choose from 
> local rack (location = /default-rack); the second replica is not found, retry 
> choosing randomly
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:792)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:598)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:558)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:392)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:268)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:121)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:137)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2093)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:287)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2602)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:864)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:549)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> java.security.AccessController.doPrivileged(Native Method)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> javax.security.auth.Subject.doAs(Subject.java:422)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> {code:java}
> $ zgrep -c NotEnoughReplicasException 
> archive/datalab-namenode.log-20180{22*,3*}.gz
> archive/datalab-namenode.log-20180228.gz:0       #hadoop 2.7.3 
> archive/datalab-namenode.log-20180301.gz:173492  #hadoop 3.0.0
> archive/datalab-namenode.log-20180302.gz:153192  #hadoop 3.0.0
> archive/datalab-namenode.log-20180303.gz:0       #hadoop 2.7.3{code}
> In JN logs: 
> {code:java}
> Feb 28 01:57:51 srvd87 datalab-journalnode[29960]: 2018-02-28 01:57:51,552 
> INFO [IPC Server handler 4 on 8485] FileJournalManager - Finalizing edits 
> file 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_inprogress_00000
> 00001274897999 -> 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_0000000001274897999-0000000001274898515
> Feb 28 01:58:34 srvd87 datalab-journalnode[29960]: 2018-02-28 01:58:34,671 
> INFO [qtp414690789-164] TransferFsImage - Sending fileName: 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_0000000001274897999-0000000001274898515,
>  fileSize: 80772. Sent total: 80772 bytes. Size of last segment intended to 
> send: -1 bytes.
> {code}
> {code:java}
> $ zgrep -c 'Size of last segment intended to send' 
> archive/datalab-journalnode.log-20180{22*,3*}.gz
> archive/datalab-journalnode.log-20180227.gz:0    #hadoop 2.7.3
> archive/datalab-journalnode.log-20180301.gz:109  #hadoop 3.0.0
> archive/datalab-journalnode.log-20180302.gz:111  #hadoop 3.0.0
> archive/datalab-journalnode.log-20180304.gz:0    #hadoop 2.7.3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to