[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

leesf (Jira) Thu, 14 May 2020 06:07:10 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


leesf updated HUDI-897:
-----------------------
    Description: 
一、scenario

The business scenarios of the data lake mainly include analysis of databases, 
logs, and files.

!image-2020-05-14-20-14-59-429.png|width=444,height=286!

Databricks delta lake also aim at these three  scenario. [1]

 

二、Hudi current situation

At present, hudi can better support the scenario where the database cdc is 
incrementally written to hudi, and it is also doing bulkload files to hudi. 

However, there is no good native support for log scenarios (requiring 
high-throughput writes, no updates, deletions, and focusing on small file 
scenarios);now can write through inserts without deduplication, but they will 
still merge on the write side.
 * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but  
every batch small  will cost some time for merge,it will reduce write 
throughput.  
 * This scene is not suitable for  merge on read. 
 * the actual scenario only needs to write parquet in batches when writing, and 
then provide reverse compaction (similar to delta lake )

三、what we can do
  
 1.On the write side, just write every batch to parquet file base on the 
snapshot mechanism,default open the merge,use can close the auto merge for more 
 write throughput.  

2. hudi support asynchronous merge small parquet files like databricks delta 
lake's  OPTIMIZE command [2] 

 

 

[1] [https://databricks.com/product/delta-lake-on-databricks]

[2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]

  was:
一、scenario

The business scenarios of the data lake mainly include analysis of databases, 
logs, and files.

!image-2020-05-14-20-14-59-429.png|width=444,height=286!

Databricks delta lake also aim at these three  scenario.

[https://databricks.com/product/delta-lake-on-databricks]

二、Hudi current situation

At present, hudi can better support the scenario where the database cdc is 
incrementally written to hudi, and it is also doing bulkload files to hudi. 

However, there is no good native support for log scenarios (requiring 
high-throughput writes, no updates, deletions, and focusing on small file 
scenarios);now can write through inserts without deduplication, but they will 
still merge on the write side.
 * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but  
every batch small  will cost some time for merge,it will reduce write 
throughput.  
 * This scene is not suitable for  merge on read. 
 * the actual scenario only needs to write parquet in batches when writing, and 
then provide reverse compaction (similar to delta lake )

三、what we can do
  
 1.On the write side, just write every batch to parquet file base on the 
snapshot mechanism,default open the merge,use can close the auto merge for more 
 write throughput.  

2. hudi support asynchronous merge small parquet files like databricks delta 
lake's  OPTIMIZE command 

[https://docs.databricks.com/delta/optimizations/file-mgmt.html]


> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-897
>                 URL: https://issues.apache.org/jira/browse/HUDI-897
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Compaction, Performance
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

Reply via email to