[
https://issues.apache.org/jira/browse/HUDI-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377084#comment-17377084
]
Yue Zhang commented on HUDI-2144:
---------------------------------
I just raise a PR trying to fix this problem
> Offline clustering(independent sparkJob) will cause insert action losing data
> -----------------------------------------------------------------------------
>
> Key: HUDI-2144
> URL: https://issues.apache.org/jira/browse/HUDI-2144
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Yue Zhang
> Priority: Major
> Attachments: image-2021-07-08-13-52-00-089.png
>
>
> For now we have two kinds of pipeline for Hudi using spark:
> # Streaming insert data to specific partition
> # Offline clustering spark
> job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size
> pipeline 1 created
> But here is a bug we met that will lose data
> These steps can make the problem reproduce stably :
> # Submit a spark job to Ingest data1 using insert mode.
> # Schedule a clustering plan using
> `org.apache.hudi.utilities.HoodieClusteringJob`
> # Submit a spark job again to Ingest data2 using insert mode(Ensure that
> there is new file slice created in the same file group which means small file
> tuning for insert is working). Suppose this file group is called A and new
> file slice is called a.
> # Execute that clustering job step2 planed.
> # Query data1+data2 you will find new data for a is lost compared with
> common ingestion without clustering
>
> !image-2021-07-08-13-52-00-089.png!
> Here is the root cause:
> When ingest data using insert mode, Hudi will find small files and try to
> append new data to them ,aiming to tuning data file size.
> [https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149]
> is try to filter Small Files In Clustering but only works when user set
> `hoodie.clustering.inline` true which is not good enough when users using
> offline clustering.
> I just raise a PR try to fix it and tested.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)