[
https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-112:
--------------------------------
Status: Open (was: New)
> Supporting a Collapse type of operation
> ---------------------------------------
>
> Key: HUDI-112
> URL: https://issues.apache.org/jira/browse/HUDI-112
> Project: Apache Hudi
> Issue Type: New Feature
> Components: Common Core
> Reporter: Nishith Agarwal
> Assignee: Nishith Agarwal
> Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by
> packing inserts and sending them over to a particular file based on the small
> file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small
> files into larger ones is borne by the writer (or the ingestor). In cases
> where we continuously want really low ingestion latency ( < 5 mins ), having
> the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation
> that can later be picked up asynchronously by a job/thread (different from
> the ingestor) that collapses N files into M files, thereby also enlarging the
> file sizes.
> The mechanism should support different strategies for scheduling collapse so
> we can perform even smarter data layout during such rewriting, for eg., group
> certain record_keys together in a single file from N different files to allow
> for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send
> inserts to log files (for a base columnar file) and when the compaction kicks
> in, it would automatically resize the file. Although, the reader (realtime
> query) would have to pay a small penalty here to merge the log files with the
> base columnar files to get freshest data.
> In any case, we need a mechanism to collapse older smaller files into larger
> ones while also keeping the query cost low. Creating this ticket to discuss
> more around this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)