[
https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Rodionov updated HBASE-14477:
--------------------------------------
Summary: Compaction improvements: Date tiered compaction policy (was:
Compaction improvements: Generational compaction policy)
> Compaction improvements: Date tiered compaction policy
> ------------------------------------------------------
>
> Key: HBASE-14477
> URL: https://issues.apache.org/jira/browse/HBASE-14477
> Project: HBase
> Issue Type: New Feature
> Reporter: Vladimir Rodionov
> Assignee: Vladimir Rodionov
> Fix For: 2.0.0
>
>
> For immutable and mostly immutable data the current SizeTiered-based
> compaction policy is not efficient.
> # There is no need to compact all files into one, because, data is (mostly)
> immutable and we do not need to collect garbage. (performance reason will be
> discussed later)
> # Size-tiered compaction is not suitable for applications where most recent
> data is most important and prevents efficient caching of this data.
> The idea of generational compaction policy is pretty similar to
> DateTieredCompaction in Cassandra:
> # Memstore flushes creates files of Gen0.
> # Only store files of the same generation can be compacted.
> # Once number of files in GenK reaches N (default, 5) they get compacted and
> one file of Gen(K+1) is created.
> # Compaction stops at predefined generation M (default, 3).
> Simple math. For the sake of simplicity, let us say that flush size is 30MB.
> Gen0: 4*30 = 120MB
> Gen1: 4*120 = 480MB
> Gen2: 4*480MB = 1.92GB
> Gen3: R * 1.92GB (Gen3 by default is not compacted)
> With 3-4 files in Gen3 we get total Region size 10-12GB, 10-20% (Gen0, Gen1
> and most of Gen2) can be kept in a block cache.
> Generational compaction does not limit region size, one can use 100GB or even
> more because total compaction IO per region can be limited and, generally
> speaking, does not depend on region size explicitly (as in Size Tiered
> compaction policy)
> Now, about performance implications:
> SSD-based servers will benefit this policy because they provide more than
> adequate random IO ... but even HDD-based system can use this policy. Again,
> simple math: with region size ~ 10GB we will have ~ 16 files, of which, 10-12
> can be cached in a block cache. Even if request touches all the files (spans
> the all time range) it will need to access to only 4-6 files. How to keep
> always recent data in a block cache is totally separate topic (JIRA).
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)