[ https://issues.apache.org/jira/browse/BEAM-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anonymous updated BEAM-12950: ----------------------------- Status: Triage Needed (was: Resolved) > Missing events when using Python WriteToFiles in streaming pipeline > ------------------------------------------------------------------- > > Key: BEAM-12950 > URL: https://issues.apache.org/jira/browse/BEAM-12950 > Project: Beam > Issue Type: Bug > Components: io-py-files > Affects Versions: 2.32.0 > Reporter: David > Assignee: David > Priority: P1 > Fix For: 2.34.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > We have a Python streaming pipeline consuming events from PubSub and writing > them into GCS, in Dataflow. > After performing some tests, we realized that we were missing events. The > reason was that some files were being deleted from the temporary folder > before moving them to the destination. > In the logs we can see that there might be a race condition: the code checks > if the file exists in temp folder before it has been actually created, Thus > it’s not moved to the destination. Afterwards, the file is considered > orphaned and is deleted from the temp folder in this line: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677]. > Since files are being moved from the temp folder to the final destination, > they shouldn’t be deleted in any case, otherwise we would lose events. For > what purpose we have “orphaned files”? Should they exist? > We know that Python SDK WriteToFiles is still experimental, but missing > events looks like a big deal and that's why I've created it as P1. If you > think it should be lowered let me know. > The easiest and safest approach right now would be to not delete any files > and log a message to warn that some files might be left orphaned in the > temporary folder. Eventually, in the next execution, that orphaned file will > be moved to the destination, so we don’t lose any event. In addition, the log > level should be always INFO, because Dataflow doesn’t accept DEBUG.level. -- This message was sent by Atlassian Jira (v8.20.10#820010)