[ 
https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575883#comment-13575883
 ] 

Nicolas Spiegelberg commented on HBASE-7667:
--------------------------------------------

Some thoughts I had about this:

Overall, I think it's a good idea.  Seems like it's not crazy to add and would 
have multiple benefits.  Logical striping across the L1 boundary is a simple 
solution to both proactively handle splits and reduce compaction times.

Thoughts on this feature
1. Fixed configs : in the same way that we got a lot of stability by limiting 
the regions/server to a fixed number, we might want to similarly limit the 
number of stripes per region to 10 (or X) instead of "every Y bytes".  This 
will help us understand the benefit we get from striping and it's easy to 
double the striping and chart the difference.
2. NameNode pressure :  Obviously, a 10x striping factor will cause 10x scaling 
of the FS.  Can we offset this by increasing the HDFS block size, since 
addBlock dominates at scale?  Really, unlike Hadoop, you have all of the HFile 
or none of it.  Missing a portion of the HFile currently invalidates the whole 
file.  You really need 1 HDFS block == 1 HFile.  However, we could probably 
just toy with increasing it by the striping factor right now and seeing if that 
balances things.
3. Open Times : I think this will be an issue, specifically on server start.  
Need to be careful here.
4. Major compaction : you can perform a major compaction (remove deletes) as 
long as you have [i,end) contiguous.  I don't think you'd need to involve L0 
files in an MC at all.  Save the complexity.  Furthermore, part of the reason 
why we created the tiered compaction is to prevent small/new files from 
participating in MC because of cache thrashing, poor minor compactions, and a 
handful of other reasons.

So, some thoughts on related pain points we seem to have that tie into this 
feature:
1. Reduce Cache thrashing : region moves kill us a lot of time because we have 
a cold cache.  There is a worry that more aggressive compactions mean more 
thrashing.  I think it will actual even this out better since right now a MC 
causes a lot of churn.  Just should be thinking about this if perf after the 
feature isn't what we desire.
2. Unnecessary IOPS : outside of this algorithm, we should just completely get 
rid of the requirement to compact after a split.  We have the block cache, so 
given a [start,end) in the file, we can easily tell our mid point for future 
splits.  There's little reason to aggressively churn in this way after 
splitting.
3. Poor locality : for grid topology setups, we should eventually make the 
striping algorithm a little more intelligent about picking our replicas.  If 
all stripes go to the same secondary & tertiary node, then splits have a very 
restricted set of servers to chose for datanode locality.

                
> Support stripe compaction
> -------------------------
>
>                 Key: HBASE-7667
>                 URL: https://issues.apache.org/jira/browse/HBASE-7667
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> So I was thinking about having many regions as the way to make compactions 
> more manageable, and writing the level db doc about how level db range 
> overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, 
> Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication 
> factor.
> And I suggest the following idea, let's call it stripe compactions. It's a 
> mix between level db ideas and having many small regions.
> It allows us to have a subset of benefits of many regions (wrt reads and 
> compactions) without many of the drawbacks (managing and current 
> memstore/etc. limitation).
> It also doesn't break seqNum-based file sorting for any one key.
> It works like this.
> The region key space is separated into configurable number of fixed-boundary 
> stripes (determined the first time we stripe the data, see below).
> All the data from memstores is written to normal files with all keys present 
> (not striped), similar to L0 in LevelDb, or current files.
> Compaction policy does 3 types of compactions.
> First is L0 compaction, which takes all L0 files and breaks them down by 
> stripe. It may be optimized by adding more small files from different 
> stripes, but the main logical outcome is that there are no more L0 files and 
> all data is striped.
> Second is exactly similar to current compaction, but compacting one single 
> stripe. In future, nothing prevents us from applying compaction rules and 
> compacting part of the stripe (e.g. similar to current policy with rations 
> and stuff, tiers, whatever), but for the first cut I'd argue let it "major 
> compact" the entire stripe. Or just have the ratio and no more complexity.
> Finally, the third addresses the concern of the fixed boundaries causing 
> stripes to be very unbalanced.
> It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the 
> results out with different boundaries.
> There's a tradeoff here - if we always take 2 adjacent stripes, compactions 
> will be smaller but rebalancing will take ridiculous amount of I/O.
> If we take many stripes we are essentially getting into the 
> epic-major-compaction problem again. Some heuristics will have to be in place.
> In general, if, before stripes are determined, we initially let L0 grow 
> before determining the stripes, we will get better boundaries.
> Also, unless unbalancing is really large we don't need to rebalance really.
> Obviously this scheme (as well as level) is not applicable for all scenarios, 
> e.g. if timestamp is your key it completely falls apart.
> The end result:
> - many small compactions that can be spread out in time.
> - reads still read from a small number of files (one stripe + L0).
> - region splits become marvelously simple (if we could move files between 
> regions, no references would be needed).
> Main advantage over Level (for HBase) is that default store can still open 
> the files and get correct results - there are no range overlap shenanigans.
> It also needs no metadata, although we may record some for convenience.
> It also would appear to not cause as much I/O.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to