[ 
https://issues.apache.org/jira/browse/BEAM-11494?focusedWorklogId=553891&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-553891
 ]

ASF GitHub Bot logged work on BEAM-11494:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Feb/21 21:18
            Start Date: 17/Feb/21 21:18
    Worklog Time Spent: 10m 
      Work Description: robertwb commented on pull request #13558:
URL: https://github.com/apache/beam/pull/13558#issuecomment-780860713


   That sounds like a good plan to me.
   
   On Wed, Feb 17, 2021 at 1:10 PM Pablo <[email protected]> wrote:
   
   > I don't think HDFS provides retention policies (but imagine e.g. an S3
   > bucket with retention policy being accessed via HDFS would use
   > HadoopFilesystem and have a retention policy)
   >
   > I might consider the question differently:
   >
   >    - If the checksum for the file is equal, it is very, very likely that
   >    the file is the same.
   >    - If the filesystem does not provide a checksum for the file, then it
   >    is not possible to know if it's the same file - even if their sizes are
   >    equal.
   >
   > So what I would propose we do differently instead is:
   >
   >    - If the files have equal checksum, we skip rewriting them
   >    - If the filesystem does not provide a checksum, then we will always
   >    overwrite (instead of matching on file size)
   >
   > This effectively means that for HDFS, we will always overwrite, which is
   > what you propose - but it's not related to whether the FS supports a
   > retention policy, but rather to whether we can be confident that the file
   > contents are the same or not. Thoughts?
   >
   > —
   > You are receiving this because you commented.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/pull/13558#issuecomment-780855078>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AADWVANVP2QAD4YF5RM4A3DS7QWDBANCNFSM4U5FVBKA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 553891)
    Time Spent: 3h 40m  (was: 3.5h)

> FileIO.Write overwrites destination files on retries
> ----------------------------------------------------
>
>                 Key: BEAM-11494
>                 URL: https://issues.apache.org/jira/browse/BEAM-11494
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>            Reporter: Pablo Estrada
>            Assignee: Pablo Estrada
>            Priority: P2
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Users have reported cases of FileIO.Write becoming stuck or failing due to 
> overwriting destination files.
> The failure/stuckness occurs because there are some file system buckets with 
> strict retention policies that do not allow files to be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to