Jean-Adrien wrote:
Hello,
I have a question regarding the behavior of HBase at startup time.
First the region servers load all regions of enabled tables, then a batch
task of (minor?) compression is made on some of these regions:
2008-12-17 11:04:46,688 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
test-D-0.3,GST13927+129099482919-13927,1229196632010
2008-12-17 11:05:36,196 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
test-D-0.3,GST13927+129099482919-13927,1229196632010 in 49sec
What are the concerned regions ? All of them ? Only the region that have
been modified during the last roll of log ?
All regions on open schedule a compaction (Usually compaction if 'minor'
unless the 'major' interval has elapsed).
We added this a while back for the following reason. Region opens
usually are the result of a split. Splits are done by creating facades
on the parent regions mapfiles. These facades -- or 'References' in
hbase-speak -- reference the parent regions' mapfiles; one facade
serves up the top-half of the parent's mapfiles while the other serves
the bottom-half. This mechanism makes it so splits run fast. Downside
is that while these References are present in a region, the region is
not splittable to avoid build up of compound, fragile
References-to-References.... relationships. Compactions clean up
References by writing the content of the parents top or bottom half into
new mapfiles in the daughter regions. During heavy-duty uploading,
splits are fast and furious. To keep it so regions are splittable as
soon as possible, we were scheduling clean-up of References as fast as
possible by immediately scheduling a compaction.
Missing from the above is special handling of startup. Andrew has
started work on this in hbase-1062.
In my case it takes several hours to complete, since I have about 500
regions for 2 region servers. And if I have well understood how hadoop
works, it yield that the entire hdfs content is rewritten during this phase,
since the file are written once. Isn't it ?
Sounds like original report on HBASE-938 (though the issue got hijacked
to address a different issue). Do you think a major compaction is being
triggered on each startup?
Was this a clean shutdown Jean-Adrien?
As to rewriting all data, it shouldn't be. Before the HBASE-938 fix,
we'd rewrite all data if a major compaction but not since its commit.
TRUNK has improvements in this area including logging what type of
compaction is running, whether major or minor.
If I disable and re-enable a table, must the compactions re-run ?
Since regions are opened on reenable, compaction check will be scheduled
but if nothing to do, the compaction will be a noop.
St.Ack