[
https://issues.apache.org/jira/browse/HBASE-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183287#comment-14183287
]
Sheetal Dolas commented on HBASE-12324:
---------------------------------------
Sean, Vlad,
Thanks for your inputs.
[~vrodionov], in our case already had all those params tuned , however the
expired data must get deleted. Which utility are you referring to ? Can one run
that while tables are active and data being ingested?
IMO Adding external utilities is error prone and operational overhead. So it
would be nice if it is inside HBase. Also as [~busbey] pointed out, tuning
these parameter needs careful evaluation and need for niche expertise.
It would be nice if HBase itself can take care of complexities and make it easy
for users/operators. I can see multiple use cases including Open TSDB which
need this to be handled elegantly.
Let me add some more details to the use case and proposed solution.
Use case:
* Very high ingest rate.
* Immutable data
* Data life is short (few days)
* Read rates are low to moderate (in comparison to ingest rates)
Issues with default major compaction (even when compactions are done rarely)
* Lot of data IO just to get out expired data out
* No other significant benefits then expired data deletion
Proposed solution
* During major (or even minor) compactions, do not compact any data
* Just delete files whose timestamp is older than TTL
* Add a new compaction policy class say
"OnlyDeleteExpiredFilesCompactionPolicy" and set these configurations while
creating the table.
'hbase.hstore.defaultengine.compactionpolicy.class' =>
'org.apache.hadoop.hbase.regionserver.compactions.OnlyDeleteExpiredFilesCompactionPolicy',
'hbase.store.delete.expired.storefile' => 'true'
Benefits
* Significant reduction in IO during compaction
* Automatically get rid of expired data
Assumptions and applicability
* TTL is defined at table level or for all CFs in table
* Cells use system timestamp for versioning or if overwritten, the overwritten
timestamp is closer to system timestamp
Attached proposed compaction policy. It appears trivially simple. Thoughts?
> Improve compaction speed and process for immutable short lived datasets
> -----------------------------------------------------------------------
>
> Key: HBASE-12324
> URL: https://issues.apache.org/jira/browse/HBASE-12324
> Project: HBase
> Issue Type: New Feature
> Components: Compaction
> Affects Versions: 0.98.0, 0.96.0
> Reporter: Sheetal Dolas
>
> We have seen multiple cases where HBase is used to store immutable data and
> the data lives for short period of time (few days)
> On very high volume systems, major compactions become very costly and
> slowdown ingestion rates.
> In all such use cases (immutable data, high write rate and moderate read
> rates and shorter ttl), avoiding any compactions and just deleting old data
> brings lot of performance benefits.
> We should have a compaction policy that can only delete/archive files older
> than TTL and not compact any files.
> Also attaching a patch that can do so.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)