[jira] [Updated] (HUDI-112) Supporting a Collapse type of operation

Vinoth Chandar (Jira) Mon, 04 Oct 2021 07:48:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinoth Chandar updated HUDI-112:
--------------------------------
    Status: Open  (was: New)

> Supporting a Collapse type of operation
> ---------------------------------------
>
>                 Key: HUDI-112
>                 URL: https://issues.apache.org/jira/browse/HUDI-112
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by 
> packing inserts and sending them over to a particular file based on the small 
> file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small 
> files into larger ones is borne by the writer (or the ingestor). In cases 
> where we continuously want really low ingestion latency ( < 5 mins ), having 
> the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation 
> that can later be picked up asynchronously by a job/thread (different from 
> the ingestor) that collapses N files into M files, thereby also enlarging the 
> file sizes. 
> The mechanism should support different strategies for scheduling collapse so 
> we can perform even smarter data layout during such rewriting, for eg., group 
> certain record_keys together in a single file from N different files to allow 
> for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send 
> inserts to log files (for a base columnar file) and when the compaction kicks 
> in, it would automatically resize the file. Although, the reader (realtime 
> query) would have to pay a small penalty here to merge the log files with the 
> base columnar files to get freshest data. 
> In any case, we need a mechanism to collapse older smaller files into larger 
> ones while also keeping the query cost low. Creating this ticket to discuss 
> more around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-112) Supporting a Collapse type of operation

Reply via email to