[GitHub] [hudi] xushiyan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode

GitBox Wed, 21 Apr 2021 15:26:24 -0700


xushiyan commented on issue #2850:
URL: https://github.com/apache/hudi/issues/2850#issuecomment-824402773



   @abhijeetkushe I think the bug is also related to 
[HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723) , in a rarer 
scenario of racing condition like you mentioned: some files written to S3 
during the same second (T1) but last hudi commit just finished with checkpoint 
saved as T1. The cause is checkpoint is only made of time info and S3 
granularity is unusually coarse (in seconds)  
   
   - A long term fix would be a different checkpointing mechanism like 
time+filename but that would take some time to implement. 
   - A short term fix is WIP in #2845 which solves the case where skipping 
caused by source limit cut on files with same mod time, but won't solve this 
racing condition. 
   - In your particular scenario, since you don't set source limit, the problem 
could be resolved by change [this 
line](https://github.com/apache/hudi/blob/03668dbaf1a60428d7e0d68c6622605e0809150a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java#L165)
 to `status.getModificationTime() >= lastCheckpointTime`. It will read some 
files twice but I think it's ok assuming you're upserting. We don't make this 
fix in #2845 is due to this could lead to infinite loop of read files with 
source limit being relatively small. Also reading same files more than once 
relies on upsert mode or use case being ok with at-least once processing.
   
   I suggest you extend the existing path selector to a custom class and change 
that line of code. Then try plug in the custom class in your pipeline and see 
if files get processed. Meanwhile, would need more input from @nsivabalan 
@vinothchandar on short/long term plan. Thanks all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode

Reply via email to