Splitting log in a hostile environment -- bad hdfs -- we drop logs
------------------------------------------------------------------
Key: HBASE-642
URL: https://issues.apache.org/jira/browse/HBASE-642
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
Fix For: 0.1.3, 0.2.0
The master has noticed that the regionserver that was carrying the .META.
region among others has died and it goes to split its logs:
{code}
2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2:
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log
and region categories,2864153,1211005494348
2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log
and region categories,5992242,1211005494349
{code}
Master can't write hdfs for some reason so can't do the log split:
{code}
2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block
blk_7852777250062244002
2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find
target node: 10.252.219.207:50010
2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: /10.254.30.79:60020. Already tried 6 time(s).
2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.254.30.79:50010
2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block
blk_7007215478628265924
2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find
target node: 10.252.219.207:50010
2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: /10.254.30.79:60020. Already tried 7 time(s).
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.254.30.79:50010
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block
blk_2374125514769088471
2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find
target node: 10.252.219.207:50010
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block
blk_-621042589816139684
2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find
target node: 10.252.219.207:50010
..
{code}
Weirdly, the above is complaining can't connect to the datanode running on same
host as master.
Eventually the split fails with:
{code}
2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing
pending operations: ProcessServerShutdown of 10.254.30.79:60020
java.io.IOException: java.io.IOException: Could not complete write to file
/hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing
loop: ProcessServerShutdown of 10.254.30.79:60020
2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown
of server 10.254.30.79:60020: logSplit: false, rootRescanned: false,
numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s)
in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1:
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file
hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log
already exists. Copying existing file to new file
2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: /10.254.30.79:60020. Already tried 9 time(s).
2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing
hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
-- continuing. Possible DATA LOSS!
java.net.SocketTimeoutException: timed out waiting for rpc response
...
{code}
And we just move on to the next log -- we notice the earlier attempt at
distributing the edits and try to pick them up (though in this case, the file
is likely empty) -- but the split of new log also fails.
Though log says 'Possible DATA LOSS!', we keep going . The .META. and other
regions are reassigned and deployed though they are likely missing edits.
In this catastrophic case, I'd say master should not move and if it has to, go
down rather than reassign regions and try to keep going.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.