dongtingting opened a new issue, #11741:
URL: https://github.com/apache/hudi/issues/11741

   
   **Describe the problem you faced**
   
    There is a job use bulk insert insert overwrite a cow table. We find there 
are 4 stage run bulk insert write, data write 4 times and only the last stage 
data remain, other 3 stage writen data is finally remove when finalize write.
   
   all of the four stage on red line do bulk insert write.
   <img width="2438" alt="image" 
src="https://github.com/user-attachments/assets/741d1516-bcd5-4f87-8042-24bf92aa96f7";>
   more details about the four stage:
   <img width="2082" alt="image" 
src="https://github.com/user-attachments/assets/fe6df5a5-507b-42da-9ed5-be7a4d1ba966";>
   <img width="1259" alt="image" 
src="https://github.com/user-attachments/assets/b4973960-1a10-4ede-8e1c-0861e8f0fc83";>
   
   This is because the four stage all use writestatus rdd, 
DatasetBulkInsertOverwriteCommitActionExecutor writeStatus rdd is not persist, 
this will case upstream rdd(bulk insert write) repeat running 4 times.
   
   DatasetBulkInsertOverwriteCommitActionExecutor 
getPartitionToReplacedFileIds: use writestatus isEmpty 
   DatasetBulkInsertOverwriteCommitActionExecutor 
getPartitionToReplacedFileIds: use writestatus distinct 
   HoodieSparkSqlWriter.commitAndPerformPostOperations : use writestatus count
   HoodieSparkSqlWriter.commitAndPerformPostOperations :use writestatus collect
   
   Upsert(do not use bulk insert) do not have this problem, because they 
persist writestatus. 
   But BaseDatasetBulkInsertCommitActionExecutor do not persist writestatus. 
   I think we should persist rdd at the beging of 
BaseDatasetBulkInsertCommitActionExecutor.buildHoodieWriteMetadata, does anyone 
agree?
   
   
   <img width="2134" alt="image" 
src="https://github.com/user-attachments/assets/a2153f36-d0e4-4b0e-abe6-8b3ba6114578";>
   
   <img width="2180" alt="image" 
src="https://github.com/user-attachments/assets/2248f178-3721-4566-9797-394f34ea7b0d";>
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   1. create a  cow table  `test_table`, using simple index
   ```java
   create table if not exists test_table
   (
       id                                string     
      , name                             string    
      ,p_date                            string        comment 'εˆ†εŒΊζ—₯期, yyyyMMdd'
   )USING hudi
        partitioned by (p_date)
         options (
         type='cow'
   );
   ``` 
   
   2.  insert overwrite table  use bulk insert 
   ```java
   set hoodie.datasource.write.operation=BULK_INSERT;
   set hoodie.bulkinsert.shuffle.parallelism=200;
   
   insert overwrite test_table   partition (p_date = '20240806')
   select  id, name, p_date 
   from source table 
   ``` 
   3. check spark job task  log, you will find isEmpty, all of distinct , count 
, collect  stage task  log have create marker and create handle log.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 2.4
   
   * Hadoop version : 2.6
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to