xushiyan commented on issue #2850: URL: https://github.com/apache/hudi/issues/2850#issuecomment-824402773
@abhijeetkushe I think the bug is also related to [HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723) , in a rarer scenario of racing condition like you mentioned: some files written to S3 during the same second (T1) but last hudi commit just finished with checkpoint saved as T1. The cause is checkpoint is only made of time info and S3 granularity is unusually coarse (in seconds) - A long term fix would be a different checkpointing mechanism like time+filename but that would take some time to implement. - A short term fix is WIP in #2845 which solves the case where skipping caused by source limit cut on files with same mod time, but won't solve this racing condition. - In your particular scenario, since you don't set source limit, the problem could be resolved by change [this line](https://github.com/apache/hudi/blob/03668dbaf1a60428d7e0d68c6622605e0809150a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java#L165) to `status.getModificationTime() >= lastCheckpointTime`. It will read some files twice but I think it's ok assuming you're upserting. We don't make this fix in #2845 is due to this could lead to infinite loop of read files with source limit being relatively small. Also reading same files more than once relies on upsert mode or use case being ok with at-least once processing. I suggest you extend the existing path selector to a custom class and change that line of code. Then try plug in the custom class in your pipeline and see if files get processed. Meanwhile, would need more input from @nsivabalan @vinothchandar on short/long term plan. Thanks all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
