[ 
https://issues.apache.org/jira/browse/BEAM-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anonymous updated BEAM-12950:
-----------------------------
    Status: Triage Needed  (was: Resolved)

> Missing events when using Python WriteToFiles in streaming pipeline
> -------------------------------------------------------------------
>
>                 Key: BEAM-12950
>                 URL: https://issues.apache.org/jira/browse/BEAM-12950
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.32.0
>            Reporter: David
>            Assignee: David
>            Priority: P1
>             Fix For: 2.34.0
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We have a Python streaming pipeline consuming events from PubSub and writing 
> them into GCS, in Dataflow.
> After performing some tests, we realized that we were missing events. The 
> reason was that some files were being deleted from the temporary folder 
> before moving them to the destination.
> In the logs we can see that there might be a race condition: the code checks 
> if the file exists in temp folder before it has been actually created, Thus 
> it’s not moved to the destination. Afterwards, the file is considered 
> orphaned and is deleted from the temp folder in this line: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677].
> Since files are being moved from the temp folder to the final destination, 
> they shouldn’t be deleted in any case, otherwise we would lose events. For 
> what purpose we have “orphaned files”? Should they exist?
> We know that Python SDK WriteToFiles is still experimental, but missing 
> events looks like a big deal and that's why I've created it as P1. If you 
> think it should be lowered let me know.
> The easiest and safest approach right now would be to not delete any files 
> and log a message to warn that some files might be left orphaned in the 
> temporary folder. Eventually, in the next execution, that orphaned file will 
> be moved to the destination, so we don’t lose any event. In addition, the log 
> level should be always INFO, because Dataflow doesn’t accept DEBUG.level. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to