Splitting log in a hostile environment -- bad hdfs -- we drop logs
------------------------------------------------------------------

                 Key: HBASE-642
                 URL: https://issues.apache.org/jira/browse/HBASE-642
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: stack
             Fix For: 0.1.3, 0.2.0


The master has noticed that the regionserver that was carrying the .META. 
region among others has died and it goes to split its logs:

{code}
2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log 
file writer for path 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log 
and region categories,2864153,1211005494348
2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log 
file writer for path 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log 
and region categories,5992242,1211005494349
{code}

Master can't write hdfs for some reason so can't do the log split:

{code}
2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block 
blk_7852777250062244002
2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
target node: 10.252.219.207:50010
2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /10.254.30.79:60020. Already tried 6 time(s).
2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.254.30.79:50010
2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block 
blk_7007215478628265924
2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
target node: 10.252.219.207:50010
2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /10.254.30.79:60020. Already tried 7 time(s).
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.254.30.79:50010
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block 
blk_2374125514769088471
2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
target node: 10.252.219.207:50010
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block 
blk_-621042589816139684
2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
target node: 10.252.219.207:50010
..
{code}

Weirdly, the above is complaining can't connect to the datanode running on same 
host as master.

Eventually the split fails with:

{code}
2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing 
pending operations: ProcessServerShutdown of 10.254.30.79:60020
java.io.IOException: java.io.IOException: Could not complete write to file 
/hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
    at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
    at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
    at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing 
loop: ProcessServerShutdown of 10.254.30.79:60020
2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown 
of server 10.254.30.79:60020: logSplit: false, rootRescanned: false, 
numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) 
in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log 
already exists. Copying existing file to new file
2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /10.254.30.79:60020. Already tried 9 time(s).
2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing 
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
 -- continuing. Possible DATA LOSS!
java.net.SocketTimeoutException: timed out waiting for rpc response
...
{code}

And we just move on to the next log -- we notice the earlier attempt at 
distributing the edits and try to pick them up (though in this case, the file 
is likely empty) -- but the split of new log also fails.

Though log says 'Possible DATA LOSS!', we keep going . The .META. and other 
regions are reassigned and deployed though they are likely missing edits.

In this catastrophic case, I'd say master should not move and if it has to, go 
down rather than reassign regions and try to keep going.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to