Master stuck splitting server logs in shutdown loop; on each iteration, edits
are aggregated up into the millions
-----------------------------------------------------------------------------------------------------------------
Key: HBASE-551
URL: https://issues.apache.org/jira/browse/HBASE-551
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
Priority: Blocker
Lars cluster is sick with master trying to split logs. The logs its replaying
have millions of edits in them.
Here is sample from log. First we get the shutdown and then in the shutdown
process, we start to split up the shutdown servers log:
{code}
2008-03-28 16:29:45,305 INFO org.apache.hadoop.hbase.HMaster: process shutdown
of server 192.168.105.37:60020: logSplit: false, rootRes
canned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2008-03-28 16:29:45,310 INFO org.apache.hadoop.hbase.HLog: splitting 3 log(s)
in hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192
.168.105.37_1206741382563_60020
2008-03-28 16:29:45,311 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 3:
hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
8.105.37_1206741382563_60020/hlog.dat.002
2008-03-28 16:29:45,380 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/488338803/oldlogfile.log and region
pdc-docs,EP01108687NWA2,1205739919655
2008-03-28 16:29:45,390 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/447465883/oldlogfile.log and region
pdc-docs,EP01900680NWA1,1205754584444
2008-03-28 16:29:45,403 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/2035706226/oldlogfile.log and region
pdc-docs,EP01119588NWA2,1205754281917
2008-03-28 16:29:45,428 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/437772136/oldlogfile.log and region
pdc-docs,EP00200190NWA2,120576451593
...
{code}
We open a file in each region to take edits. We then start replaying the 3 WAL
files from the regionserver.
On the second one, we get exception...
{code}
2008-03-28 16:40:36,537 WARN org.apache.hadoop.hbase.HLog: Old log file
hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/1045858
46/oldlogfile.log already exists. Copying existing file to new file
2008-03-28 16:40:36,545 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/104585846/oldlogfile.log and region
pdc-docs,EP96104830NWA1,1205768785572
2008-03-28 16:40:36,979 DEBUG org.apache.hadoop.hbase.HLog: Copied 220000 edits
2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Applied 222812
total edits
2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Splitting 1 of 3:
hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
8.105.37_1206741382563_60020/hlog.dat.003
2008-03-28 16:40:56,883 WARN org.apache.hadoop.hbase.HLog: Old log file
hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/2118067
194/oldlogfile.log already exists. Copying existing file to new file
2008-03-28 16:40:56,891 DEBUG org.apache.hadoop.hbase.HLog: Creating new log
file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
000/hbase/pdc-docs/2118067194/oldlogfile.log and region
pdc-docs,EP97302517NWA2,1205726201776
2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Applied 36638 total
edits
2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Splitting 2 of 3:
hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
8.105.37_1206741382563_60020/hlog.dat.004
2008-03-28 16:41:18,684 WARN org.apache.hadoop.hbase.HMaster: Processing
pending operations: ProcessServerShutdown of 192.168.105.37:60
020
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:56)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1829)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1729)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1775)
at org.apache.hadoop.hbase.HLog.splitLog(HLog.java:540)
at
org.apache.hadoop.hbase.HMaster$ProcessServerShutdown.process(HMaster.java:2167)
at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1085)
{code}
A finally clause makes sure we close up all the new files we've made in all
regions. These new files have accumulated some edits from the splitting of the
first file.
Because we got an exception, the shutdown processing runs again.
Because regions have files in place with edits, we won't overwrite them second
time through. We instead copy the old into a new file to which we start
appending until the exception happens again.
After a couple of hours, we're up into the millions of edits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.