[ 
https://issues.apache.org/jira/browse/BEAM-5036?focusedWorklogId=145644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-145644
 ]

ASF GitHub Bot logged work on BEAM-5036:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Sep/18 12:39
            Start Date: 19/Sep/18 12:39
    Worklog Time Spent: 10m 
      Work Description: timrobertson100 commented on issue #6289: [BEAM-5036] 
Optimize the FileBasedSink WriteOperation.moveToOutput()
URL: https://github.com/apache/beam/pull/6289#issuecomment-422786830
 
 
   @reuvenlax @iemejia @jbonofre  - I am a bit unsure what to do here and would 
appreciate your thoughts. Note this is all about big performance improvements 
for IO that write to HDFS.
   
   The Beam `FileSystems.rename()` is under documented and performs differently 
depending on the underlying filesystem. For example HDFS will fail if the file 
exists, while we use the `StandardCopyOption.REPLACE_EXISTING` in 
`LocalFileSystem` and always overwrite.
   
   In this PR I opted to include the addition of a 
`StandardMoveOptions.OVERWRITE_EXISTING_FILES` and if the underlying FS threw a 
`FileAlreadyExistsException` then it would only be overwritten if that flag was 
enabled. This would work nicely for HDFS. However, this logic is at the mercy 
of the underlying FS as many won't surface an error and will simply overwrite. 
Thus if a user does not include the 
`StandardMoveOptions.OVERWRITE_EXISTING_FILES` they may see surprising results.
   
   I think we can do one of the following:
   1. Be explicit and make all FS respect a control flag to enable overwriting
   2. Silently overwrite always 
   3. Let people provide the flag, knowing some FS will ignore it
   
   Note that `FileSystem.rename()` is _never_ used in the Beam codebase until 
this PR but is a public method so we might affect others. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 145644)
    Time Spent: 2h 10m  (was: 2h)

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to