[ 
https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108792#comment-17108792
 ] 

liwei commented on HUDI-112:
----------------------------

Hi, Nishith Agarwal [~nishith29]

we  also meet this issue, in RFC 19: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
 
|https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction]

we need a mechanism to solve two issues.
1. On the write side: do not compaction for faster write. (now merge on read 
can solve this problem)
2. compaction and read : also a mechanism to collapse older smaller files into 
larger ones while also keeping the query cost low.(if use merge on read, if do 
not compaction, the realtime read will slow)

we have a option:
1. On the write side: just write parquet, not compaction
2. compaction and read : because the small file is parquet, the realtime read 
can be fast, also user can use asynchronous compaction to collapse older 
smaller parquet files into larger parquet files

what is  the current progress of this issue? :)

> Supporting a Collapse type of operation
> ---------------------------------------
>
>                 Key: HUDI-112
>                 URL: https://issues.apache.org/jira/browse/HUDI-112
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by 
> packing inserts and sending them over to a particular file based on the small 
> file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small 
> files into larger ones is borne by the writer (or the ingestor). In cases 
> where we continuously want really low ingestion latency ( < 5 mins ), having 
> the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation 
> that can later be picked up asynchronously by a job/thread (different from 
> the ingestor) that collapses N files into M files, thereby also enlarging the 
> file sizes. 
> The mechanism should support different strategies for scheduling collapse so 
> we can perform even smarter data layout during such rewriting, for eg., group 
> certain record_keys together in a single file from N different files to allow 
> for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send 
> inserts to log files (for a base columnar file) and when the compaction kicks 
> in, it would automatically resize the file. Although, the reader (realtime 
> query) would have to pay a small penalty here to merge the log files with the 
> base columnar files to get freshest data. 
> In any case, we need a mechanism to collapse older smaller files into larger 
> ones while also keeping the query cost low. Creating this ticket to discuss 
> more around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to