[
https://issues.apache.org/jira/browse/HBASE-15181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151531#comment-15151531
]
Enis Soztutar commented on HBASE-15181:
---------------------------------------
[~claraxiong] this is great work BTW. Thanks for pushing for this.
I just wanted to bring one open item back to jira to see whether ordering files
with timestamps, rather than seqid, and doing non-contiguous is acceptable:
{quote}
The tiered structure is built completely and solely on the data timestamp of
the store files. We cannot sort by segId at all. Any logic for updates/deletes
depending on seqId would break. The user needs to guarantee updates or deletes
are in order aligned with time stamp order. This compaction policy is pluggable
and this limitation will be lifted if the work to allow compaction out of order
of seqId is done. As you pointed out in the ticket: "What I was saying offline
is that we can actually do something like HBASE-9905 and disallow
client-settable timestamps, or do something like HBASE-10247 where the table
pre-declares that we won't have same-ts edits, it should be possible to do
non-contigous compactions."
{quote}
Given that there is no hard-guarantees as of now about whether the client can
do out of order timestamp writes, can we still always be correct, but if the
client does an excessive amount of these writes, the compaction will not
perform as efficiently. Basically, if we can, I would like a system where the
client will get the full benefit automatically if the timestamps follow seqId
order, but if not, the results are still correct. If there are occasional
out-of-order writes, the performance is not that badly affected, if not, the
compaction algorithm can behave badly.
I think we can achieve this with something like this:
- Use max ts as in the design for store files.
- Instead of ordering files by decreasing ts, order files by decreasing seqId.
- Iterating from highest seqId to lowest, find the tier that the file belongs
to using maxTs. The only difference from the current algorithm is that in the
iteration, we should always assign tiers in increasing order t0, t1, t2. This
means that if out of order data is present, and we end up with flushes where
maxTs is very old, lets say it falls into t2, then t1 and t0 would be empty and
all files will be t2+. Otherwise (if you do not have out of order writes, or
have them occasionally) the behavior will be the same as in the design.
Alternatively HFiles also have CREATE_TIME_TS, which is different than
maxTimestamp. maxTS comes from the user data, while hfile create time is the
system time at the time of hfile writing. If we do the tier selection based on
hfile time instead of users maxTs, then we might not even have that problem.
Again, if there is actual correlation of user's timestamps with the seqIds (or
hfile create times), you would get all the benefits, otherwise, we would still
return the correct results, but compaction may not be optimal (I think it will
be like falling back to exploring one). Anyway, just a suggestion to consider.
I might not have thought of all corner cases.
You are saying that this patch is also in production. Are there any numbers
you've collected?
> A simple implementation of date based tiered compaction
> -------------------------------------------------------
>
> Key: HBASE-15181
> URL: https://issues.apache.org/jira/browse/HBASE-15181
> Project: HBase
> Issue Type: New Feature
> Components: Compaction
> Reporter: Clara Xiong
> Assignee: Clara Xiong
> Fix For: 2.0.0, 1.3.0, 0.98.19
>
> Attachments: HBASE-15181-v1.patch, HBASE-15181-v2.patch
>
>
> This is a simple implementation of date-based tiered compaction similar to
> Cassandra's for the following benefits:
> 1. Improve date-range-based scan by structuring store files in date-based
> tiered layout.
> 2. Reduce compaction overhead.
> 3. Improve TTL efficiency.
> Perfect fit for the use cases that:
> 1. has mostly date-based date write and scan and a focus on the most recent
> data.
> 2. never or rarely deletes data.
> Out-of-order writes are handled gracefully so the data will still get to the
> right store file for time-range-scan and re-compacton with existing store
> file in the same time window is handled by ExploringCompactionPolicy.
> Time range overlapping among store files is tolerated and the performance
> impact is minimized.
> Configuration can be set at hbase-site or overriden at per-table or
> per-column-famly level by hbase shell.
> Design spec is at
> https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLKIqL6T-oHoRLZ323MG_uy8/edit?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)