[
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bhavani Sudha updated HUDI-897:
-------------------------------
Fix Version/s: (was: 0.6.0)
0.6.1
> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Compaction, Performance
> Reporter: liwei
> Assignee: liwei
> Priority: Major
> Fix For: 0.6.1
>
> Attachments: image-2020-05-14-19-51-37-938.png,
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases,
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three scenario. [1]
>
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is
> incrementally written to hudi, and it is also doing bulkload files to hudi.
> However, there is no good native support for log scenarios (requiring
> high-throughput writes, no updates, deletions, and focusing on small file
> scenarios);now can write through inserts without deduplication, but they will
> still merge on the write side.
> * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but
> every batch small will cost some time for merge,it will reduce write
> throughput.
> * This scene is not suitable for merge on read.
> * the actual scenario only needs to write parquet in batches when writing,
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>
> 1.On the write side, just write every batch to parquet file base on the
> snapshot mechanism,default open the merge,use can close the auto merge for
> more write throughput.
> 2. hudi support asynchronous merge small parquet files like databricks delta
> lake's OPTIMIZE command [2]
>
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)