[ 
https://issues.apache.org/jira/browse/HBASE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625909#action_12625909
 ] 

Andrew Purtell commented on HBASE-846:
--------------------------------------

I saw a region server go down due to DFS replication faliure last week but have 
not had time to investigate deeper. It occurred during a test at approx 20000 
transactions per second on a 6 node cluster (5 datanode/regionserver nodes). I 
believe the root cause was a DFS collapse, the datanode logs were so noisy (was 
using 0.17.1 at the time) there appeared to be only hay in the haystack. 
Certainly though the DFS volume was not full. Utilization was at about 0.5% on 
every datanode.

If the regionserver was able to survive this, the drop in transaction rate due 
to such a hiccup would presumably allow DFS to recover and we'd eventually 
recover. Instead the cluster enters a death spiral with one fewer region server 
at every step.

> hbase looses its mind when hdfs fills
> -------------------------------------
>
>                 Key: HBASE-846
>                 URL: https://issues.apache.org/jira/browse/HBASE-846
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>
> Looking in log, I see:
> {code}
> 2008-08-26 18:57:23,602 INFO org.apache.hadoop.dfs.DFSClient: 
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File 
> /hbase/aa0-000-8.u.powerset.com/log_208.76.45.95_1218666613846_60020/hlog.dat.1219776799293
>  could only be replicated to 0 nodes, instead of 1
>         at 
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
>         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
>         at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>         at org.apache.hadoop.ipc.Client.call(Client.java:557)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>         at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)
> {code}
> ... and then:
> {code}
> 2008-08-26 18:57:28,423 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery 
> for block null bad datanode[0]
> 2008-08-26 18:57:28,424 FATAL org.apache.hadoop.hbase.regionserver.HLog: 
> Could not append. Requesting close of log
> java.io.IOException: Could not get block locations. Aborting... 
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2081)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> 2008-08-26 18:57:28,424 INFO org.apache.hadoop.hbase.regionserver.LogRoller: 
> Rolling hlog. Number of entries: 127
> 2008-08-26 18:57:28,424 ERROR org.apache.hadoop.hbase.regionserver.LogRoller: 
> Log rolling failed
> java.io.IOException: Could not get block locations. Aborting...
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2081)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> ...
> {code}
> ... and so on.
> Meantime clients are trying to do updates and getting below:
> {code}
> 2008-08-26 22:49:42,834 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 9 on 60020, call batchUpdate([EMAIL PROTECTED], row => 
> IKwQLMJ3rKRvtAv_ZkQlAk==, {column => page:url, value => '...', column => 
> page:contents, value => '...'}) from 208.76.45.3:51164: error: 
> java.io.IOException: Could not get block locations. Aborting... 
> java.io.IOException: Could not get block locations. Aborting...
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2081)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> ..
> {code}
> DFSClient seems horked.
> Need to be able to ride out these kind of event.  
> Restart is needed.
> Test this by filling HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to