[
https://issues.apache.org/jira/browse/HBASE-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-551:
------------------------
Attachment: 551.patch
M src/java/org/apache/hadoop/hbase/HLog.java
(splitLog): If an exception processing a split, catch it.
In finally, close and delete the split. Don't try retrying.
While in some circumstance, we might recover, its also
likely that we just get same exception again. If so, and
multiple files, we'll just accumulate edits until the
kingdom comes.
> Master stuck splitting server logs in shutdown loop; on each iteration, edits
> are aggregated up into the millions
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-551
> URL: https://issues.apache.org/jira/browse/HBASE-551
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: stack
> Priority: Blocker
> Attachments: 551.patch
>
>
> Lars cluster is sick with master trying to split logs. The logs its
> replaying have millions of edits in them.
> Here is sample from log. First we get the shutdown and then in the shutdown
> process, we start to split up the shutdown servers log:
> {code}
> 2008-03-28 16:29:45,305 INFO org.apache.hadoop.hbase.HMaster: process
> shutdown of server 192.168.105.37:60020: logSplit: false, rootRes
> canned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2008-03-28 16:29:45,310 INFO org.apache.hadoop.hbase.HLog: splitting 3 log(s)
> in hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192
> .168.105.37_1206741382563_60020
> 2008-03-28 16:29:45,311 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 3:
> hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.002
> 2008-03-28 16:29:45,380 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/488338803/oldlogfile.log and region
> pdc-docs,EP01108687NWA2,1205739919655
> 2008-03-28 16:29:45,390 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/447465883/oldlogfile.log and region
> pdc-docs,EP01900680NWA1,1205754584444
> 2008-03-28 16:29:45,403 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2035706226/oldlogfile.log and region
> pdc-docs,EP01119588NWA2,1205754281917
> 2008-03-28 16:29:45,428 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/437772136/oldlogfile.log and region
> pdc-docs,EP00200190NWA2,120576451593
> ...
> {code}
> We open a file in each region to take edits. We then start replaying the 3
> WAL files from the regionserver.
> On the second one, we get exception...
> {code}
> 2008-03-28 16:40:36,537 WARN org.apache.hadoop.hbase.HLog: Old log file
> hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/1045858
> 46/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:36,545 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/104585846/oldlogfile.log and region
> pdc-docs,EP96104830NWA1,1205768785572
> 2008-03-28 16:40:36,979 DEBUG org.apache.hadoop.hbase.HLog: Copied 220000
> edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Applied 222812
> total edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Splitting 1 of 3:
> hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.003
> 2008-03-28 16:40:56,883 WARN org.apache.hadoop.hbase.HLog: Old log file
> hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/2118067
> 194/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:56,891 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
> file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2118067194/oldlogfile.log and region
> pdc-docs,EP97302517NWA2,1205726201776
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Applied 36638
> total edits
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Splitting 2 of 3:
> hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.004
> 2008-03-28 16:41:18,684 WARN org.apache.hadoop.hbase.HMaster: Processing
> pending operations: ProcessServerShutdown of 192.168.105.37:60
> 020
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:178)
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:56)
> at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1829)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1729)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1775)
> at org.apache.hadoop.hbase.HLog.splitLog(HLog.java:540)
> at
> org.apache.hadoop.hbase.HMaster$ProcessServerShutdown.process(HMaster.java:2167)
> at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1085)
> {code}
> A finally clause makes sure we close up all the new files we've made in all
> regions. These new files have accumulated some edits from the splitting of
> the first file.
> Because we got an exception, the shutdown processing runs again.
> Because regions have files in place with edits, we won't overwrite them
> second time through. We instead copy the old into a new file to which we
> start appending until the exception happens again.
> After a couple of hours, we're up into the millions of edits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.