[
https://issues.apache.org/jira/browse/HBASE-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907478#action_12907478
]
stack commented on HBASE-2967:
------------------------------
So, looking at a snapshot of log files on our production, about 1/2 had this
issue. After rolling out the above change, subsequent log files are again
parseable.
> Failed split: IOE 'File is Corrupt!' -- sync length not being written out to
> SequenceFile
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-2967
> URL: https://issues.apache.org/jira/browse/HBASE-2967
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Priority: Blocker
> Fix For: 0.90.0
>
>
> We saw this on one of our clusters:
> {code}
> 2010-09-07 18:07:16,229 WARN
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Failed processing:
> ProcessServerShutdown of sv4borg18,60020,1283516293515; putting onto delayed
> todo queue
> java.io.IOException: File is corrupt!
> at
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1907)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1932)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:121)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:113)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.parseHLog(HLog.java:1493)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1256)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1143)
> at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:299)
> at
> org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:147)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:532)
> {code}
> Because it was an IOE, it got requeued. Each time around we failed on it
> again.
> A few things:
> + This exception needs to add filename and the position in file at which
> problem found.
> + Need to commit little patch over in HBASE-2889 that outputs position and
> ordinal of wal edit because it helps diagnose these kinds of issues.
> + We should be able to skip the bad edit; just postion ourselves at byte past
> the bad sync and start reading again
> + There must be something about our setup that makes it so we fail write of
> the sync 16 random bytes that make up the SF 'sync' marker though oddly for
> one of the files, the sync failure happens at 1/3rd of the way into a 64MB
> wal, edit #2000 out of 130k odd edits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.