[ https://issues.apache.org/jira/browse/HDFS-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687526#comment-16687526 ]
Xiao Chen commented on HDFS-14081: ---------------------------------- Thanks [~shwetayakkali] for the patch and [~kihwal] for the suggestion. Please don't set Fix Versions - it's supposed to be used by committers to reflect which branch a patch is actually checked in. [~kihwal]'s suggestion makes sense - some queues only make sense for ANN, we may as well bump FSN#metaSave to be active-only. Though I'm not sure that is what we see here exactly (and the NPE itself feels it warrants some improvement) ... When discussing with [~shwetayakkali], one part of code I found interesting is {{BlockManager#rescanPostponedMisreplicatedBlocks}} checks {{BlockInfo}} nullity, where {{BlockManager#dumpBlockMeta}} does not. They're all protected by FSN write locks, but the missing null check seemed like a bug. Please let me know if I missed anything. > hdfs dfsadmin -metasave metasave_test results NPE > ------------------------------------------------- > > Key: HDFS-14081 > URL: https://issues.apache.org/jira/browse/HDFS-14081 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 3.2.1 > Reporter: Shweta > Assignee: Shweta > Priority: Major > Attachments: HDFS-14081.001.patch > > > Race condition is encountered while adding Block to > postponedMisreplicatedBlocks which in turn tried to retrieve Block from > BlockManager in which it may not be present. > This happens in HA, metasave in first NN succeeded but failed in second NN, > StackTrace showing NPE is as follows: > {code} > 2018-07-12 21:39:09,783 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 24 on 8020, call Call#1 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.metaSave from > 172.26.9.163:602342018-07-12 21:39:09,783 WARN org.apache.hadoop.ipc.Server: > IPC Server handler 24 on 8020, call Call#1 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.metaSave from > 172.26.9.163:60234java.lang.NullPointerException at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseSourceDatanodes(BlockManager.java:2175) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.dumpBlockMeta(BlockManager.java:830) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.metaSave(BlockManager.java:762) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.metaSave(FSNamesystem.java:1782) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.metaSave(FSNamesystem.java:1766) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.metaSave(NameNodeRpcServer.java:1320) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.metaSave(ClientNamenodeProtocolServerSideTranslatorPB.java:928) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1685) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org