[
https://issues.apache.org/jira/browse/BEAM-5036?focusedWorklogId=145644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-145644
]
ASF GitHub Bot logged work on BEAM-5036:
----------------------------------------
Author: ASF GitHub Bot
Created on: 19/Sep/18 12:39
Start Date: 19/Sep/18 12:39
Worklog Time Spent: 10m
Work Description: timrobertson100 commented on issue #6289: [BEAM-5036]
Optimize the FileBasedSink WriteOperation.moveToOutput()
URL: https://github.com/apache/beam/pull/6289#issuecomment-422786830
@reuvenlax @iemejia @jbonofre - I am a bit unsure what to do here and would
appreciate your thoughts. Note this is all about big performance improvements
for IO that write to HDFS.
The Beam `FileSystems.rename()` is under documented and performs differently
depending on the underlying filesystem. For example HDFS will fail if the file
exists, while we use the `StandardCopyOption.REPLACE_EXISTING` in
`LocalFileSystem` and always overwrite.
In this PR I opted to include the addition of a
`StandardMoveOptions.OVERWRITE_EXISTING_FILES` and if the underlying FS threw a
`FileAlreadyExistsException` then it would only be overwritten if that flag was
enabled. This would work nicely for HDFS. However, this logic is at the mercy
of the underlying FS as many won't surface an error and will simply overwrite.
Thus if a user does not include the
`StandardMoveOptions.OVERWRITE_EXISTING_FILES` they may see surprising results.
I think we can do one of the following:
1. Be explicit and make all FS respect a control flag to enable overwriting
2. Silently overwrite always
3. Let people provide the flag, knowing some FS will ignore it
Note that `FileSystem.rename()` is _never_ used in the Beam codebase until
this PR but is a public method so we might affect others.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 145644)
Time Spent: 2h 10m (was: 2h)
> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
> Issue Type: Improvement
> Components: io-java-files
> Affects Versions: 2.5.0
> Reporter: Jozef Vilcek
> Assignee: Tim Robertson
> Priority: Major
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by
> copy+delete. It would be better to use a rename() which can be much more
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)