@reuvenlax @iemejia @jbonofre - I am a bit unsure what to do here and would appreciate your thoughts. Note this is all about big performance improvements for IO that write to HDFS.
The Beam `FileSystems.rename()` is under documented and performs differently depending on the underlying filesystem. For example HDFS will fail if the file exists, while we use the `StandardCopyOption.REPLACE_EXISTING` in `LocalFileSystem` and always overwrite. In this PR I opted to include the addition of a `StandardMoveOptions.OVERWRITE_EXISTING_FILES` and if the underlying FS threw a `FileAlreadyExistsException` then it would only be overwritten if that flag was enabled. This would work nicely for HDFS. However, this logic is at the mercy of the underlying FS as many won't surface an error and will simply overwrite. Thus if a user does not include the `StandardMoveOptions.OVERWRITE_EXISTING_FILES` they may see surprising results. I think we can do one of the following: 1. Be explicit and make all FS respect a control flag to enable overwriting 2. Silently overwrite always 3. Let people provide the flag, knowing some FS will ignore it Note that `FileSystem.rename()` is _never_ used in the Beam codebase until this PR but is a public method so we might affect others. [ Full content available at: https://github.com/apache/beam/pull/6289 ] This message was relayed via gitbox.apache.org for [email protected]
