joeytman commented on issue #9971: URL: https://github.com/apache/hudi/issues/9971#issuecomment-1791317978
By the way, I tried using `index.bootstrap.enabled=true` and the behavior changed somewhat, though it's still not working correctly. After enabling it and letting the index bootstrap complete, the Flink CDC pipeline was able to update those spark-written files, however, it's still writing new files as well and not sticking to the bucket count that we configured (presumably new rows are written to new buckets, but updates to old rows are applied correctly). E.g, we see a few versions of this file group, written first by spark and then updated by Flink pipeline: ``` 2023-11-02 12:52:50 479396588 00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102112623423.parquet 2023-11-02 14:15:43 479405062 00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102133626787.parquet 2023-11-02 17:45:06 479397481 00000104-c312-4946-a262-78235098e60a-0_0-4-1_20231102161908199.parquet ``` But, we also see new file groups added by Flink that do not follow bucket index naming convention: ``` 2023-11-02 11:42:51 15930591 aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102112623423.parquet 2023-11-02 13:53:04 20403822 aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102133626787.parquet 2023-11-02 16:37:01 24544541 aa5b775e-95f8-4b91-bf8b-5a083664e6d2_1-4-1_20231102161908199.parquet ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
