dongtingting opened a new issue, #11741:
URL: https://github.com/apache/hudi/issues/11741
**Describe the problem you faced**
There is a job use bulk insert insert overwrite a cow table. We find there
are 4 stage run bulk insert write, data write 4 times and only the last stage
data remain, other 3 stage writen data is finally remove when finalize write.
all of the four stage on red line do bulk insert write.
<img width="2438" alt="image"
src="https://github.com/user-attachments/assets/741d1516-bcd5-4f87-8042-24bf92aa96f7">
more details about the four stage:
<img width="2082" alt="image"
src="https://github.com/user-attachments/assets/fe6df5a5-507b-42da-9ed5-be7a4d1ba966">
<img width="1259" alt="image"
src="https://github.com/user-attachments/assets/b4973960-1a10-4ede-8e1c-0861e8f0fc83">
This is because the four stage all use writestatus rdd,
DatasetBulkInsertOverwriteCommitActionExecutor writeStatus rdd is not persist,
this will case upstream rdd(bulk insert write) repeat running 4 times.
DatasetBulkInsertOverwriteCommitActionExecutor
getPartitionToReplacedFileIds: use writestatus isEmpty
DatasetBulkInsertOverwriteCommitActionExecutor
getPartitionToReplacedFileIds: use writestatus distinct
HoodieSparkSqlWriter.commitAndPerformPostOperations : use writestatus count
HoodieSparkSqlWriter.commitAndPerformPostOperations :use writestatus collect
Upsert(do not use bulk insert) do not have this problem, because they
persist writestatus.
But BaseDatasetBulkInsertCommitActionExecutor do not persist writestatus.
I think we should persist rdd at the beging of
BaseDatasetBulkInsertCommitActionExecutor.buildHoodieWriteMetadata, does anyone
agree?
<img width="2134" alt="image"
src="https://github.com/user-attachments/assets/a2153f36-d0e4-4b0e-abe6-8b3ba6114578">
<img width="2180" alt="image"
src="https://github.com/user-attachments/assets/2248f178-3721-4566-9797-394f34ea7b0d">
**To Reproduce**
Steps to reproduce the behavior:
1. create a cow table `test_table`, using simple index
```java
create table if not exists test_table
(
id string
, name string
,p_date string comment 'εεΊζ₯ζ, yyyyMMdd'
)USING hudi
partitioned by (p_date)
options (
type='cow'
);
```
2. insert overwrite table use bulk insert
```java
set hoodie.datasource.write.operation=BULK_INSERT;
set hoodie.bulkinsert.shuffle.parallelism=200;
insert overwrite test_table partition (p_date = '20240806')
select id, name, p_date
from source table
```
3. check spark job task log, you will find isEmpty, all of distinct , count
, collect stage task log have create marker and create handle log.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : 0.14.0
* Spark version : 2.4
* Hadoop version : 2.6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]