abhijeetkushe edited a comment on issue #2850:
URL: https://github.com/apache/hudi/issues/2850#issuecomment-824917902


   @xushiyan Thanks for your prompt reply.I agree that the issue I am facing is 
somewhat related to 
[HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723).
   It is great that the hudi team is actively working on addressing this 
issue.We have come up with the below interim solution to address our issue
   
   - We are using INSERT while writing our data as that is both memory and time 
efficient so using UPSERT just to handle missing files will not work for us
   - The solution you proposed for overriding the DFSPathSelector will work for 
us.We are planning to override the  [below 
line](https://github.com/apache/hudi/blob/release-0.6.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java#L92)
 
   with `f.getModificationTime() <= 
Long.valueOf(lastCheckpointStr.get()).longValue() ||
   f.getModificationTime() > (System.currentTimeMillis() - 30000)`. We are 
using hudi version 0.6.0
   This will result in a 30 seconds lag while writing records which is 
acceptable to us and will address missing file problem completely.The 30 
seconds lag will be configurable via an environment variable.The 
HoodieDeltaStreamer takes  --source-class as a argument where we will be 
providing our custom JsonDFSSource which delegates to our custom 
DFSPathSelector.
   - Can you please validate whether hoodiedeltastreamer will be able to record 
correct checkpoint with the change I am proposing to make above ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to