[ 
https://issues.apache.org/jira/browse/HDFS-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456139#comment-13456139
 ] 

Eli Collins commented on HDFS-3936:
-----------------------------------

@Colin, the exception here is not unexpected, so asserting on IE here would 
mean shutdown fails.

@Todd, BM#updatedNeededReplications is the only place this patch swallows the 
IOE, think think it should be propagated to all callers? The IE comes from the 
interrupt in BM#close which subsequently swallows the IE so it seemed 
equivalent. I could add Thread.currentThread().interrupt() so we throw an IE 
again but that will just get swallowed right?

The top-level RPC methods and test util methods turn the IE into an IOE, think 
those should be preserved as IE as well? IIUC the RPC code will marshal it into 
an IOE anyway.

While looping TestDFSClientRetries I found a related issue. Interrupting out of 
the RM lock fixes the issue where the BM does not actually exiting and races 
with the replication monitor (since it now gets interrupted), but client RPCs 
can still race with NN shutdown. After fixing this TestDFSClientRetries 
eventually fails with:

{noformat}
  Exception 0: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlockCollection(BlockManager.java:2947)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isValidBlock(FSNamesystem.java:4477)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.allocateBlock(FSNamesystem.java:2460)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2221)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:476)
{noformat}

The NN should really stop the RPC server and drain all RPCs before shutting 
down the FSN, BM etc. Thinking that should be punted to another change. With 
the following this test passes when looped for 10 hours because this test only 
races on NN#addBlock.

{code}
+  private boolean isClosed() {
+    return blocks == null;
+  }
+
   BlockCollection getBlockCollection(Block b) {
+    if (isClosed()) {
+      return null; // This call raced with close
+    }
{code}
                
> MiniDFSCluster shutdown may fail due to BlocksMap#getBlockCollection NPE
> ------------------------------------------------------------------------
>
>                 Key: HDFS-3936
>                 URL: https://issues.apache.org/jira/browse/HDFS-3936
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>         Attachments: hdfs-3936.txt
>
>
> Looks like HDFS-3664 didn't fix the whole issue because the added join times 
> out because the thread closing the BM (FSN#stopCommonServices) holds the FSN 
> lock while closing the BM and the BM is block uninterruptedly trying to 
> aquire the FSN lock.
> {noformat}
> 2012-09-13 18:54:12,526 FATAL hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1355)) - Test resulted in an unexpected exit
> org.apache.hadoop.util.ExitUtil$ExitException: Fatal exception with message 
> null
> stack trace
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1132)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1107)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3061)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3023)
>       at java.lang.Thread.run(Thread.java:662)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to