big-doudou commented on PR #9182: URL: https://github.com/apache/hudi/pull/9182#issuecomment-1654900164
> > how should these log files be cleaned up. Duplicate bucket id files cause tasks to fail to start all the time > > The log expected to be cleaned when the instant is committed (we have a marker machanism to ensure the retried files got cleaned), then issue here is why these partitial files are visible to the `BucketStreamWriter`, that's the direction we should dig into. > > See `BaseHoodieWriteClient#commit (line 279)`, the `finalizeWrite` would clean up the retried files based on markers. > > In `BucketStreamWriteFunction:line 160`, you can debug whether these intermediate logs are visible to the function view `getLatestFileSlices`, if you have a existing table to test this, would be helpful. Thank you for such a detailed answer. let me test it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
