[ 
https://issues.apache.org/jira/browse/HADOOP-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529188
 ] 

stack commented on HADOOP-1924:
-------------------------------

Here's notes on examination of a couple of stack traces produced last night by 
Jim in patch build #798 before application of HADOOP-1923.

In short, we are hung in a flush/write down in the rpc Client call inside of a 
synchronize on the output stream.  Its never going to return. There is no 
server side to this blocked write in the stack trace.  The dfs has been shut 
down.   When the server side went away, we should have gotten a connection 
aborted exception.  SInce there is no timeout on an established socket write, 
this flush/write is just going to sit there; its never going to complete and 
because its inside a synchronize block, all regionserver threads are blocked 
from continuing/exiting.

Here is the blocked thread:

{code}
    [junit] "regionserver/0.0.0.0:62532.splitOrCompactChecker" daemon prio=10 
tid=0x089bc6b8 nid=0x67 runnable [0xe28b3000..0xe28b3db8]
    [junit]     at java.net.SocketOutputStream.socketWrite0(Native Method)
    [junit]     at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    [junit]     at 
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    [junit]     at 
org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:190)
    [junit]     at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    [junit]     at 
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    [junit]     - locked <0xe7f640a8> (a java.io.BufferedOutputStream)
    [junit]     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
    [junit]     at 
org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:325)
    [junit]     - locked <0xe7f63f78> (a java.io.DataOutputStream)
    [junit]     at org.apache.hadoop.ipc.Client.call(Client.java:462)
    [junit]     - locked <0xf66a4488> (a org.apache.hadoop.ipc.Client$Call)
    [junit]     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
    [junit]     at org.apache.hadoop.dfs.$Proxy0.getFileInfo(Unknown Source)
    [junit]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    [junit]     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    [junit]     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    [junit]     at java.lang.reflect.Method.invoke(Method.java:585)
    [junit]     at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    [junit]     at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    [junit]     at org.apache.hadoop.dfs.$Proxy0.getFileInfo(Unknown Source)
    [junit]     at 
org.apache.hadoop.dfs.DFSClient.getFileInfo(DFSClient.java:432)
    [junit]     at 
org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:337)
    [junit]     at 
org.apache.hadoop.hbase.HStoreFile.length(HStoreFile.java:971)
    [junit]     at org.apache.hadoop.hbase.HStore.size(HStore.java:1322)
    [junit]     at 
org.apache.hadoop.hbase.HRegion.largestHStore(HRegion.java:624)
    [junit]     at org.apache.hadoop.hbase.HRegion.needsSplit(HRegion.java:584)
    [junit]     at 
org.apache.hadoop.hbase.HRegionServer$SplitOrCompactChecker.checkForSplitsOrCompactions(HRegionServer.java:204)
    [junit]     at 
org.apache.hadoop.hbase.HRegionServer$SplitOrCompactChecker.chore(HRegionServer.java:189)
    [junit]     - locked <0xe7f74028> (a java.lang.Integer)
    [junit]     at org.apache.hadoop.hbase.Chore.run(Chore.java:59)
{code}

The main changes in the applied patch for HADOOP-1923 are a change the health 
of the hdfs is checked -- an exists check on the hbase root dir rather than a 
request for datanode info -- and removal of a synchronize around the call to 
the check on filesystem health.  I'm unable to explain why these changes would 
make the hang go away.

The filesystem health check is just a look up on namenode in-memory 
datastructures whether its a get of state on datanodes or a check a directory 
exists.

The removal of the synchronization on the filesystem check should just make it 
so instead of threads parking on the entrance to FsUtils.isFileSystemAvailable, 
their park spot moves down the stack to org.apache.hadoop.ipc.Client.call 
waiting for sendParams lock of the output stream to free up (Other threads in 
the stack trace are already parked in this spot).

> [hbase] TestDFSAbort failed in nightly #242
> -------------------------------------------
>
>                 Key: HADOOP-1924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1924
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>         Attachments: testdfsabort.patch
>
>
> TestDFSAbort and TestBloomFilters failed in last nights nightly build (#242). 
>  This issue is about trying to figure whats up w/ TDFSA.
> Studying console logs, HRegionServer stopped logging any activity and HMaster 
> for its part did not expire the HRegionServer lease.  On top of it all, 
> continued tests of the state of HDFS -- the test is meant to sure Hbase 
> shutdown when HDFS is pulled from under it -- seems to have continued 
> reporting itself healthy though it'd be closed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to