[ 
https://issues.apache.org/jira/browse/HBASE-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902279#action_12902279
 ] 

Todd Lipcon commented on HBASE-2933:
------------------------------------

I can't remember the particular JIRA either, but it seems to me that the 
regionserver shouldn't even get to the point of doing recovery if the logs 
haven't been completely recovered. ie the phases should be:

1) Original RS is writing logs and dies
2) Master A notices failure and starts splitting logs. It gets halfway through 
writing region_1/oldlog
3) Master A dies
4) Master B takes over, and knows from ZK that RS's recovery is incomplete.
5) Master B should remove the half-written log split done by Master A, and try 
again from the start.

ie no region server should attempt to open region 1 until the logs have been 
properly split. Thus, the RS should never see an EOFException on log recovery, 
since it indicates that log splitting is incomplete.

> Always Skip Errors during Log Recovery
> --------------------------------------
>
>                 Key: HBASE-2933
>                 URL: https://issues.apache.org/jira/browse/HBASE-2933
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Nicolas Spiegelberg
>            Assignee: Nicolas Spiegelberg
>
> While testing a cluster, we hit upon the following assert during region 
> assigment.  We were killing the master during a long run of splits.  We think 
> what happened is that the HMaster was killed while splitting, woke up & split 
> again.  If this happens, we will have 2 files: 1 partially written and 1 
> complete one.  Since encountering partial log splits upon Master failure is 
> considered normal behavior, we should continue at the RS level if we 
> encounter an EOFException & not an filesystem-level exception, even with 
> skip.errors == false.
> 2010-08-20 16:59:07,718 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening 
> MailBox_dsanduleac,57db45276ece7ce03ef7e8d9969eb189:[email protected],1280960828959.7c542d24d4496e273b739231b01885e6.
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1902)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1932)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:121)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:113)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1981)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1956)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:1915)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:344)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1490)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1437)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1345)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-08-20 16:59:07,719 ERROR 
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Aborting open of 
> region 7c542d24d4496e273b739231b01885e6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to