[
https://issues.apache.org/jira/browse/HBASE-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-2933:
-------------------------
Fix Version/s: 0.90.0
Priority: Critical (was: Major)
Bringing into 0.90. Need to write a test to ensure that new master removes old
partially split logs if old master died mid-split; also make it so we don't die
if RS gets EOF -- though this should never happen as Todd says if proper split
-- but also we should keep going if we get something like IOE ""File is
corrupt!" (See below)
{code}
private synchronized int readRecordLength() throws IOException {
if (in.getPos() >= end) {
return -1;
}
int length = in.readInt();
if (version > 1 && sync != null &&
length == SYNC_ESCAPE) { // process a sync entry
in.readFully(syncCheck); // read syncCheck
if (!Arrays.equals(sync, syncCheck)) // check it
throw new IOException("File is corrupt!");
syncSeen = true;
if (in.getPos() >= end) {
return -1;
}
length = in.readInt(); // re-read length
} else {
syncSeen = false;
}
return length;
}
{code}
> Skip EOF Errors during Log Recovery
> -----------------------------------
>
> Key: HBASE-2933
> URL: https://issues.apache.org/jira/browse/HBASE-2933
> Project: HBase
> Issue Type: Bug
> Reporter: Nicolas Spiegelberg
> Assignee: Nicolas Spiegelberg
> Priority: Critical
> Fix For: 0.90.0
>
>
> While testing a cluster, we hit upon the following assert during region
> assigment. We were killing the master during a long run of splits. We think
> what happened is that the HMaster was killed while splitting, woke up & split
> again. If this happens, we will have 2 files: 1 partially written and 1
> complete one. Since encountering partial log splits upon Master failure is
> considered normal behavior, we should continue at the RS level if we
> encounter an EOFException & not an filesystem-level exception, even with
> skip.errors == false.
> 2010-08-20 16:59:07,718 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> MailBox_dsanduleac,57db45276ece7ce03ef7e8d9969eb189:[email protected],1280960828959.7c542d24d4496e273b739231b01885e6.
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:375)
> at
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1902)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1932)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:121)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:113)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1981)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1956)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:1915)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:344)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1490)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1437)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1345)
> at java.lang.Thread.run(Thread.java:619)
> 2010-08-20 16:59:07,719 ERROR
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Aborting open of
> region 7c542d24d4496e273b739231b01885e6
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.