[ 
https://issues.apache.org/jira/browse/HBASE-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237401#comment-15237401
 ] 

Dave Latham commented on HBASE-15454:
-------------------------------------

Sorry, I think I'm a bit behind Duo and Clara, but I'm still trying to get the 
picture of this right in my head.

So it sounds like EC is independent of this work and can be noted purely as 
motivation to have large, infrequently accessed files.

* What does it actually mean to "archive" a store file?  Is there a definition, 
or set of properties or guarantees?
** Are archived files excluded from major compaction?  Or minor compactions?  
Or from region split size calculation?
** Are archived files guaranteed to have no timestamp overlap with other 
HFiles?  Or just other archived HFiles?
** Or does it just refer to any files with max timestamp older than maxAge?
* Should archiving be a separate modality with a separate method or just happen 
as part of compaction with the given window schedule?

{quote}I find the first and last files that overlapping with current archive 
window, and then compact all files between them. These makes sure that all data 
belongs to this window are contained in the output file.{quote}

Is that first and last file ordered by sequence id?  With max timestamp in 
current archive window?  What if the seq id ordering and timestamp overlapping 
don't match up?

I do suspect that it's most efficient to have all windows and tiers in 
alignment - that if one desires calendar based files for the archive, that one 
would be better off using calendar derived windows for all the data.  For 
example, if you want the highest tier to be calendar years, then lower tiers 
could be 3-month quarters, months, weekOfMonth (some week windows would not be 
full 7 days but that should be ok), days, 6-hour blocks.  

Otherwise, the transition of files from one scheme to another seems likely to 
require splitting existing data from a file into multiple windows.  Maybe 
that's OK.

Taking a quick look over Duo's github link there: I like how there is a 
pluggable window factory.  I think if we have that we should try to move the 
window specific configuration out of the generic CompactionConfiguration into 
the specific window factory.  Also, I'm not sure if the intent is for 
ExponentialThenCalendricalCompactionWindowFactory to be in the hbase code or 
it's just there as an illustration of an alternate plugin - I tend to think it 
should not be included by default.

As a side note, it seem unfortunate to add joda time as a full dependency when 
most people probably won't use tiered compaction, let alone calendar based 
windows / archives.  Perhaps using JDK classes would suffice or even direct 
basic logic in the code?  Or if it's just included with a window factory plugin 
then only people using that would need it.

> Archive store files older than max age
> --------------------------------------
>
>                 Key: HBASE-15454
>                 URL: https://issues.apache.org/jira/browse/HBASE-15454
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction
>    Affects Versions: 2.0.0, 1.3.0, 0.98.18, 1.4.0
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.0.0, 1.3.0, 0.98.19, 1.4.0
>
>         Attachments: HBASE-15454-v1.patch, HBASE-15454.patch
>
>
> Sometimes the old data is rarely touched but we can not remove it. So archive 
> it to several big files(by year or something) and use EC to reduce the 
> redundancy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to