[
https://issues.apache.org/jira/browse/HBASE-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237401#comment-15237401
]
Dave Latham commented on HBASE-15454:
-------------------------------------
Sorry, I think I'm a bit behind Duo and Clara, but I'm still trying to get the
picture of this right in my head.
So it sounds like EC is independent of this work and can be noted purely as
motivation to have large, infrequently accessed files.
* What does it actually mean to "archive" a store file? Is there a definition,
or set of properties or guarantees?
** Are archived files excluded from major compaction? Or minor compactions?
Or from region split size calculation?
** Are archived files guaranteed to have no timestamp overlap with other
HFiles? Or just other archived HFiles?
** Or does it just refer to any files with max timestamp older than maxAge?
* Should archiving be a separate modality with a separate method or just happen
as part of compaction with the given window schedule?
{quote}I find the first and last files that overlapping with current archive
window, and then compact all files between them. These makes sure that all data
belongs to this window are contained in the output file.{quote}
Is that first and last file ordered by sequence id? With max timestamp in
current archive window? What if the seq id ordering and timestamp overlapping
don't match up?
I do suspect that it's most efficient to have all windows and tiers in
alignment - that if one desires calendar based files for the archive, that one
would be better off using calendar derived windows for all the data. For
example, if you want the highest tier to be calendar years, then lower tiers
could be 3-month quarters, months, weekOfMonth (some week windows would not be
full 7 days but that should be ok), days, 6-hour blocks.
Otherwise, the transition of files from one scheme to another seems likely to
require splitting existing data from a file into multiple windows. Maybe
that's OK.
Taking a quick look over Duo's github link there: I like how there is a
pluggable window factory. I think if we have that we should try to move the
window specific configuration out of the generic CompactionConfiguration into
the specific window factory. Also, I'm not sure if the intent is for
ExponentialThenCalendricalCompactionWindowFactory to be in the hbase code or
it's just there as an illustration of an alternate plugin - I tend to think it
should not be included by default.
As a side note, it seem unfortunate to add joda time as a full dependency when
most people probably won't use tiered compaction, let alone calendar based
windows / archives. Perhaps using JDK classes would suffice or even direct
basic logic in the code? Or if it's just included with a window factory plugin
then only people using that would need it.
> Archive store files older than max age
> --------------------------------------
>
> Key: HBASE-15454
> URL: https://issues.apache.org/jira/browse/HBASE-15454
> Project: HBase
> Issue Type: Sub-task
> Components: Compaction
> Affects Versions: 2.0.0, 1.3.0, 0.98.18, 1.4.0
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Fix For: 2.0.0, 1.3.0, 0.98.19, 1.4.0
>
> Attachments: HBASE-15454-v1.patch, HBASE-15454.patch
>
>
> Sometimes the old data is rarely touched but we can not remove it. So archive
> it to several big files(by year or something) and use EC to reduce the
> redundancy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)