[
https://issues.apache.org/jira/browse/BEAM-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425538#comment-17425538
]
David commented on BEAM-12950:
------------------------------
Issue already fixed with a temporary workaround.
I have created another issue to work on a definitive solution when deleting
orphaned files:
https://issues.apache.org/jira/browse/BEAM-13010
> Missing events when using Python WriteToFiles in streaming pipeline
> -------------------------------------------------------------------
>
> Key: BEAM-12950
> URL: https://issues.apache.org/jira/browse/BEAM-12950
> Project: Beam
> Issue Type: Bug
> Components: io-py-files
> Affects Versions: 2.32.0
> Reporter: David
> Assignee: David
> Priority: P1
> Time Spent: 4h
> Remaining Estimate: 0h
>
> We have a Python streaming pipeline consuming events from PubSub and writing
> them into GCS, in Dataflow.
> After performing some tests, we realized that we were missing events. The
> reason was that some files were being deleted from the temporary folder
> before moving them to the destination.
> In the logs we can see that there might be a race condition: the code checks
> if the file exists in temp folder before it has been actually created, Thus
> it’s not moved to the destination. Afterwards, the file is considered
> orphaned and is deleted from the temp folder in this line:
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677].
> Since files are being moved from the temp folder to the final destination,
> they shouldn’t be deleted in any case, otherwise we would lose events. For
> what purpose we have “orphaned files”? Should they exist?
> We know that Python SDK WriteToFiles is still experimental, but missing
> events looks like a big deal and that's why I've created it as P1. If you
> think it should be lowered let me know.
> The easiest and safest approach right now would be to not delete any files
> and log a message to warn that some files might be left orphaned in the
> temporary folder. Eventually, in the next execution, that orphaned file will
> be moved to the destination, so we don’t lose any event. In addition, the log
> level should be always INFO, because Dataflow doesn’t accept DEBUG.level.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)