[jira] [Commented] (HBASE-15400) Using Multiple Output for Date Tiered Compaction

Duo Zhang (JIRA) Sat, 05 Mar 2016 20:03:02 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181990#comment-15181990
 ]


Duo Zhang commented on HBASE-15400:
-----------------------------------

{quote}
4. Since we will assign the same seq id for all output files, we need to sort 
by maxTimestamp subsequently. Right now all compaction policy gets the files 
sorted for StoreFileManager which sorts by seq id and other criteria. I will 
use this order for DTCP only, to avoid impacting other compaction policies.
{quote}
If we only write to all windows in major compaction and only write out two 
files in minor compaction, I think we could assign different seqIds for the 
output files?
For major compaction, let 'seqId' be the max seqIds of all files, then we could 
use seqId, seqId - 1, seqId - 2... for the output files. And for minor 
compaction, at least we could have two input files, so just reuse the seqIds of 
these files is enough?

{quote}
3. We have to pass the boundaries with the list of store file as a complete 
time snapshot instead of two separate calls because window layout is determined 
by the time the computation is called. So we will need new type of compaction 
request.
{quote}
What about store the new fields in CompactionContext? Especially, store 'now' 
in compaction context and use it in subsequent calls?

Thanks.

> Using Multiple Output for Date Tiered Compaction
> ------------------------------------------------
>
>                 Key: HBASE-15400
>                 URL: https://issues.apache.org/jira/browse/HBASE-15400
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>             Fix For: 2.0.0
>
>
> When we compact, we can output multiple files along the current window 
> boundaries. There are two use cases:
> 1. Major compaction: We want to output date tiered store files.
> 2. Bulk load files and the old file generated by major compaction before 
> upgrading to DTCP.
> Pros: 
> 1. Restore locality, process versioning, updates and deletes while 
> maintaining the tiered layout.
> 2. The best way to fix a skewed layout.
>  
> I am starting from a prototype of date tiered file writer from HBASE-15389 
> and will upload a patch soon. I have to call out a few design decisions:
> 1. We only want to output the files along all windows for major compaction. 
> 2. For minor compaction, we don't want to output too many files, which will 
> remain around because of current restriction of contiguous compaction by seq 
> id. I will only output two files if all the files in the windows are being 
> combined, one for the data within window and the other for the out-of-window 
> tail. If there is any file in the window excluded from compaction, only one 
> file will be output from compaction. When the windows are promoted, the 
> situation of out of order data will gradually improve.
> 3. We have to pass the boundaries with the list of store file as a complete 
> time snapshot instead of two separate calls because window layout is 
> determined by the time the computation is called. So we will need new type of 
> compaction request. 
> 4. Since we will assign the same seq id for all output files, we need to sort 
> by maxTimestamp subsequently. Right now all compaction policy gets the files 
> sorted for StoreFileManager which sorts by seq id and other criteria. I will 
> use this order for DTCP only, to avoid impacting other compaction policies. 
> 5. We need some cleanup of current design of StoreEngine and CompactionPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15400) Using Multiple Output for Date Tiered Compaction

Reply via email to