abhijeetkushe commented on issue #2850: URL: https://github.com/apache/hudi/issues/2850#issuecomment-824917902
@xushiyan Thanks for your prompt reply.I agree that the issue I am facing is somewhat related to [HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723). It is great that the hudi team is actively working on addressing this issue.We have come up with the below interim solution to address our issue - We are using INSERT while writing our data as that is both memory and time efficient so using UPSERT just to handle missing files will not work for us - The solution you proposed for overriding the DFSPathSelector will work for us.We are planning to override the below line https://github.com/apache/hudi/blob/03668dbaf1a60428d7e0d68c6622605e0809150a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java#L165 with `status.getModificationTime() > lastCheckpointTime && status.getModificationTime() <= (System.currentTimeMillis() - 30000)`. This will result in a 30 seconds lag while writing records which is acceptable to us and will address missing file problem completely.The 30 seconds lag will be configurable via an environment variable.The HoodieDeltaStreamer takes --source-class as a argument where we will be providing our custom JsonDFSSource which delegates to our custom DFSPathSelector. - Can you please validate whether hoodiedeltastreamer will be able to record correct checkpoint with the change I am proposing to make above ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
