[GitHub] [beam] pabloem commented on pull request #13558: [BEAM-11494] FileIO stops overwriting files on retries

GitBox Wed, 17 Feb 2021 13:10:03 -0800


pabloem commented on pull request #13558:
URL: https://github.com/apache/beam/pull/13558#issuecomment-780855078



   I don't think HDFS provides retention policies (but imagine e.g. an S3 
bucket with retention policy being accessed via HDFS would use HadoopFilesystem 
and have a retention policy)
   
   I might consider the question differently:
   - If the checksum for the file is equal, it is very, very likely that the 
file is the same.
   - If the filesystem does not provide a checksum for the file, then it is not 
possible to know if it's the same file - even if their sizes are equal.
   
   So what I would propose we do differently instead is:
   - If the files have equal checksum, we skip rewriting them
   - If the filesystem does not provide a checksum, then we will always 
overwrite (instead of matching on file size)
   
   This effectively means that for HDFS, we will always overwrite, which is 
what you propose - but it's not related to whether the FS supports a retention 
policy, but rather to whether we can be confident that the file contents are 
the same or not. Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] pabloem commented on pull request #13558: [BEAM-11494] FileIO stops overwriting files on retries

Reply via email to