[ 
https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-112.
---------------------------------
    Resolution: Duplicate

> Supporting a Collapse type of operation
> ---------------------------------------
>
>                 Key: HUDI-112
>                 URL: https://issues.apache.org/jira/browse/HUDI-112
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by 
> packing inserts and sending them over to a particular file based on the small 
> file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small 
> files into larger ones is borne by the writer (or the ingestor). In cases 
> where we continuously want really low ingestion latency ( < 5 mins ), having 
> the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation 
> that can later be picked up asynchronously by a job/thread (different from 
> the ingestor) that collapses N files into M files, thereby also enlarging the 
> file sizes. 
> The mechanism should support different strategies for scheduling collapse so 
> we can perform even smarter data layout during such rewriting, for eg., group 
> certain record_keys together in a single file from N different files to allow 
> for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send 
> inserts to log files (for a base columnar file) and when the compaction kicks 
> in, it would automatically resize the file. Although, the reader (realtime 
> query) would have to pay a small penalty here to merge the log files with the 
> base columnar files to get freshest data. 
> In any case, we need a mechanism to collapse older smaller files into larger 
> ones while also keeping the query cost low. Creating this ticket to discuss 
> more around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to