joeytman commented on issue #9971:
URL: https://github.com/apache/hudi/issues/9971#issuecomment-1791317978

   By the way, I tried using `index.bootstrap.enabled=true` and the behavior 
changed somewhat, though it's still not working correctly. After enabling it 
and letting the index bootstrap complete, the Flink CDC pipeline was able to 
update those spark-written files, however, it's still writing new files as well 
and not sticking to the bucket count that we configured (presumably new rows 
are written to new buckets, but updates to old rows are applied correctly). 
   
   E.g, we see a few versions of this file group, written first by spark and 
then updated by Flink pipeline:
   ```
   2023-11-02 12:52:50  479396588 
00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102112623423.parquet
   2023-11-02 14:15:43  479405062 
00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102133626787.parquet
   2023-11-02 17:45:06  479397481 
00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102161908199.parquet
   ```
   But, we also see new file groups added by Flink that do not follow bucket 
index naming convention:
   ```
   2023-11-02 11:42:51   15930591 
aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102112623423.parquet
   2023-11-02 13:53:04   20403822 
aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102133626787.parquet
   2023-11-02 16:37:01   24544541 
aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102161908199.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to