[ https://issues.apache.org/jira/browse/HADOOP-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529188 ]
stack commented on HADOOP-1924: ------------------------------- Here's notes on examination of a couple of stack traces produced last night by Jim in patch build #798 before application of HADOOP-1923. In short, we are hung in a flush/write down in the rpc Client call inside of a synchronize on the output stream. Its never going to return. There is no server side to this blocked write in the stack trace. The dfs has been shut down. When the server side went away, we should have gotten a connection aborted exception. SInce there is no timeout on an established socket write, this flush/write is just going to sit there; its never going to complete and because its inside a synchronize block, all regionserver threads are blocked from continuing/exiting. Here is the blocked thread: {code} [junit] "regionserver/0.0.0.0:62532.splitOrCompactChecker" daemon prio=10 tid=0x089bc6b8 nid=0x67 runnable [0xe28b3000..0xe28b3db8] [junit] at java.net.SocketOutputStream.socketWrite0(Native Method) [junit] at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) [junit] at java.net.SocketOutputStream.write(SocketOutputStream.java:136) [junit] at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:190) [junit] at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) [junit] at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) [junit] - locked <0xe7f640a8> (a java.io.BufferedOutputStream) [junit] at java.io.DataOutputStream.flush(DataOutputStream.java:106) [junit] at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:325) [junit] - locked <0xe7f63f78> (a java.io.DataOutputStream) [junit] at org.apache.hadoop.ipc.Client.call(Client.java:462) [junit] - locked <0xf66a4488> (a org.apache.hadoop.ipc.Client$Call) [junit] at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165) [junit] at org.apache.hadoop.dfs.$Proxy0.getFileInfo(Unknown Source) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:585) [junit] at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) [junit] at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) [junit] at org.apache.hadoop.dfs.$Proxy0.getFileInfo(Unknown Source) [junit] at org.apache.hadoop.dfs.DFSClient.getFileInfo(DFSClient.java:432) [junit] at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:337) [junit] at org.apache.hadoop.hbase.HStoreFile.length(HStoreFile.java:971) [junit] at org.apache.hadoop.hbase.HStore.size(HStore.java:1322) [junit] at org.apache.hadoop.hbase.HRegion.largestHStore(HRegion.java:624) [junit] at org.apache.hadoop.hbase.HRegion.needsSplit(HRegion.java:584) [junit] at org.apache.hadoop.hbase.HRegionServer$SplitOrCompactChecker.checkForSplitsOrCompactions(HRegionServer.java:204) [junit] at org.apache.hadoop.hbase.HRegionServer$SplitOrCompactChecker.chore(HRegionServer.java:189) [junit] - locked <0xe7f74028> (a java.lang.Integer) [junit] at org.apache.hadoop.hbase.Chore.run(Chore.java:59) {code} The main changes in the applied patch for HADOOP-1923 are a change the health of the hdfs is checked -- an exists check on the hbase root dir rather than a request for datanode info -- and removal of a synchronize around the call to the check on filesystem health. I'm unable to explain why these changes would make the hang go away. The filesystem health check is just a look up on namenode in-memory datastructures whether its a get of state on datanodes or a check a directory exists. The removal of the synchronization on the filesystem check should just make it so instead of threads parking on the entrance to FsUtils.isFileSystemAvailable, their park spot moves down the stack to org.apache.hadoop.ipc.Client.call waiting for sendParams lock of the output stream to free up (Other threads in the stack trace are already parked in this spot). > [hbase] TestDFSAbort failed in nightly #242 > ------------------------------------------- > > Key: HADOOP-1924 > URL: https://issues.apache.org/jira/browse/HADOOP-1924 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Priority: Minor > Attachments: testdfsabort.patch > > > TestDFSAbort and TestBloomFilters failed in last nights nightly build (#242). > This issue is about trying to figure whats up w/ TDFSA. > Studying console logs, HRegionServer stopped logging any activity and HMaster > for its part did not expire the HRegionServer lease. On top of it all, > continued tests of the state of HDFS -- the test is meant to sure Hbase > shutdown when HDFS is pulled from under it -- seems to have continued > reporting itself healthy though it'd be closed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.