[ 
https://issues.apache.org/jira/browse/HBASE-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-642.
-------------------------

    Resolution: Fixed
      Assignee: stack

Committed branch and trunk.

Changed the try/catches around the reconstruction log replay so that we only 
catch and pass on the EOFExceptions.  If an IOE, just report it and let the 
region fail on each open attempt until human intervention.  If IOE playing 
reconstruction log, there is something badly wrong probably in hdfs.

{code}
Index: src/java/org/apache/hadoop/hbase/HStore.java
===================================================================
--- src/java/org/apache/hadoop/hbase/HStore.java        (revision 660681)
+++ src/java/org/apache/hadoop/hbase/HStore.java        (working copy)
@@ -825,13 +825,21 @@
     
     try {
       doReconstructionLog(reconstructionLog, maxSeqId, reporter);
+    } catch (EOFException e) {
+      // Presume we got here because of lack of HADOOP-1700; for now keep going
+      // but this is probably not what we want long term.  If we got here there
+      // has been data-loss
+      LOG.warn("Exception processing reconstruction log " + reconstructionLog +
+        " opening " + this.storeName +
+        " -- continuing.  Probably lack-of-HADOOP-1700 causing DATA LOSS!", e);
     } catch (IOException e) {
-      // Presume we got here because of some HDFS issue or because of a lack of
-      // HADOOP-1700; for now keep going but this is probably not what we want
-      // long term.  If we got here there has been data-loss
+      // Presume we got here because of some HDFS issue. Don't just keep going.
+      // Fail to open the HStore.  Probably means we'll fail over and over
+      // again until human intervention but alternative has us skipping logs
+      // and losing edits: HBASE-642.
       LOG.warn("Exception processing reconstruction log " + reconstructionLog +
-        " opening " + this.storeName +
-        " -- continuing.  Probably DATA LOSS!", e);
+        " opening " + this.storeName, e);
+      throw e;
     }
{code}


> Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log 
> edits
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-642
>                 URL: https://issues.apache.org/jira/browse/HBASE-642
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.1.3, 0.2.0
>
>
> The master has noticed that the regionserver that was carrying the .META. 
> region among others has died and it goes to split its logs:
> {code}
> 2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
> 2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log 
> file writer for path 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log 
> and region categories,2864153,1211005494348
> 2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log 
> file writer for path 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log 
> and region categories,5992242,1211005494349
> {code}
> Master can't write hdfs for some reason so can't do the log split:
> {code}
> 2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
> createBlockOutputStream java.net.SocketTimeoutException: Read timed out
> 2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning 
> block blk_7852777250062244002
> 2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
> target node: 10.252.219.207:50010
> 2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: /10.254.30.79:60020. Already tried 6 time(s).
> 2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
> createBlockOutputStream java.io.IOException: Bad connect ack with 
> firstBadLink 10.254.30.79:50010
> 2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning 
> block blk_7007215478628265924
> 2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
> target node: 10.252.219.207:50010
> 2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: /10.254.30.79:60020. Already tried 7 time(s).
> 2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
> createBlockOutputStream java.io.IOException: Bad connect ack with 
> firstBadLink 10.254.30.79:50010
> 2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning 
> block blk_2374125514769088471
> 2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
> target node: 10.252.219.207:50010
> 2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in 
> createBlockOutputStream java.net.SocketTimeoutException: Read timed out
> 2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning 
> block blk_-621042589816139684
> 2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find 
> target node: 10.252.219.207:50010
> ..
> {code}
> Weirdly, the above is complaining can't connect to the datanode running on 
> same host as master.
> Eventually the split fails with:
> {code}
> 2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing 
> pending operations: ProcessServerShutdown of 10.254.30.79:60020
> java.io.IOException: java.io.IOException: Could not complete write to file 
> /hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
>     at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
>     at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>     at 
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
>     at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
> 2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main 
> processing loop: ProcessServerShutdown of 10.254.30.79:60020
> 2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process 
> shutdown of server 10.254.30.79:60020: logSplit: false, rootRescanned: false, 
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) 
> in 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
> 2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
> 2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log 
> already exists. Copying existing file to new file
> 2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: /10.254.30.79:60020. Already tried 9 time(s).
> 2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception 
> processing 
> hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
>  -- continuing. Possible DATA LOSS!
> java.net.SocketTimeoutException: timed out waiting for rpc response
> ...
> {code}
> And we just move on to the next log -- we notice the earlier attempt at 
> distributing the edits and try to pick them up (though in this case, the file 
> is likely empty) -- but the split of new log also fails.
> Though log says 'Possible DATA LOSS!', we keep going . The .META. and other 
> regions are reassigned and deployed though they are likely missing edits.
> In this catastrophic case, I'd say master should not move and if it has to, 
> go down rather than reassign regions and try to keep going.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to