[
https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-2375:
--------------------------------------
Attachment: HBASE-2375-flush-split.patch
This patch adds a check for split after flushing, I've been running this with
write heavy tests and it makes it makes you skip at least one compaction before
splitting.
> Make decision to split based on aggregate size of all StoreFiles and revisit
> related config params
> --------------------------------------------------------------------------------------------------
>
> Key: HBASE-2375
> URL: https://issues.apache.org/jira/browse/HBASE-2375
> Project: HBase
> Issue Type: Improvement
> Components: regionserver
> Affects Versions: 0.20.3
> Reporter: Jonathan Gray
> Assignee: Jonathan Gray
> Priority: Critical
> Labels: moved_from_0_20_5
> Attachments: HBASE-2375-flush-split.patch, HBASE-2375-v8.patch
>
>
> Currently we will make the decision to split a region when a single StoreFile
> in a single family exceeds the maximum region size. This issue is about
> changing the decision to split to be based on the aggregate size of all
> StoreFiles in a single family (but still not aggregating across families).
> This would move a check to split after flushes rather than after compactions.
> This issue should also deal with revisiting our default values for some
> related configuration parameters.
> The motivating factor for this change comes from watching the behavior of
> RegionServers during heavy write scenarios.
> Today the default behavior goes like this:
> - We fill up regions, and as long as you are not under global RS heap
> pressure, you will write out 64MB (hbase.hregion.memstore.flush.size)
> StoreFiles.
> - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a
> compaction on this region.
> - Compaction queues notwithstanding, this will create a 192MB file, not
> triggering a split based on max region size (hbase.hregion.max.filesize).
> - You'll then flush two more 64MB MemStores and hit the compactionThreshold
> and trigger a compaction.
> - You end up with 192 + 64 + 64 in a single compaction. This will create a
> single 320MB and will trigger a split.
> - While you are performing the compaction (which now writes out 64MB more
> than the split size, so is about 5X slower than the time it takes to do a
> single flush), you are still taking on additional writes into MemStore.
> - Compaction finishes, decision to split is made, region is closed. The
> region now has to flush whichever edits made it to MemStore while the
> compaction ran. This flushing, in our tests, is by far the dominating factor
> in how long data is unavailable during a split. We measured about 1 second
> to do the region closing, master assignment, reopening. Flushing could take
> 5-6 seconds, during which time the region is unavailable.
> - The daughter regions re-open on the same RS. Immediately when the
> StoreFiles are opened, a compaction is triggered across all of their
> StoreFiles because they contain references. Since we cannot currently split
> a split, we need to not hang on to these references for long.
> This described behavior is really bad because of how often we have to rewrite
> data onto HDFS. Imports are usually just IO bound as the RS waits to flush
> and compact. In the above example, the first cell to be inserted into this
> region ends up being written to HDFS 4 times (initial flush, first compaction
> w/ no split decision, second compaction w/ split decision, third compaction
> on daughter region). In addition, we leave a large window where we take on
> edits (during the second compaction of 320MB) and then must make the region
> unavailable as we flush it.
> If we increased the compactionThreshold to be 5 and determined splits based
> on aggregate size, the behavior becomes:
> - We fill up regions, and as long as you are not under global RS heap
> pressure, you will write out 64MB (hbase.hregion.memstore.flush.size)
> StoreFiles.
> - After each MemStore flush, we calculate the aggregate size of all
> StoreFiles. We can also check the compactionThreshold. For the first three
> flushes, both would not hit the limit. On the fourth flush, we would see
> total aggregate size = 256MB and determine to make a split.
> - Decision to split is made, region is closed. This time, the region just
> has to flush out whichever edits made it to the MemStore during the
> snapshot/flush of the previous MemStore. So this time window has shrunk by
> more than 75% as it was the time to write 64MB from memory not 320MB from
> aggregating 5 hdfs files. This will greatly reduce the time data is
> unavailable during splits.
> - The daughter regions re-open on the same RS. Immediately when the
> StoreFiles are opened, a compaction is triggered across all of their
> StoreFiles because they contain references. This would stay the same.
> In this example, we only write a given cell twice (instead of 4 times) while
> drastically reducing data unavailability during splits. On the original
> flush, and post-split to remove references. The other benefit of post-split
> compaction (which doesn't change) is that we then get good data locality as
> the resulting StoreFile will be written to the local DataNode. In another
> jira, we should deal with opening up one of the daughter regions on a
> different RS to distribute load better, but that's outside the scope of this
> one.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira