David created BEAM-12950:
----------------------------
Summary: Missing events when using Python WriteToFiles in
streaming pipeline
Key: BEAM-12950
URL: https://issues.apache.org/jira/browse/BEAM-12950
Project: Beam
Issue Type: Bug
Components: io-py-files
Affects Versions: 2.32.0
Reporter: David
We have a Python streaming pipeline consuming events from PubSub and writing
them into GCS, in Dataflow.
After performing some tests, we realized that we were missing events. The
reason was that some files were being deleted from the temporary folder before
moving them to the destination.
In the logs we can see that there might be a race condition: the code checks if
the file exists in temp folder before it has been actually created, Thus it’s
not moved to the destination. Afterwards, the file is considered orphaned and
is deleted from the temp folder in this line:
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677].
Since files are being moved from the temp folder to the final destination, they
shouldn’t be deleted in any case, otherwise we would lose events. For what
purpose we have “orphaned files”? Should they exist?
We know that Python SDK WriteToFiles is still experimental, but missing events
looks like a big deal and that's why I've created it as P1. If you think it
should be lowered let me know.
The easiest and safest approach right now would be to not delete any files and
log a message to warn that some files might be left orphaned in the temporary
folder. Eventually, in the next execution, that orphaned file will be moved to
the destination, so we don’t lose any event. In addition, the log level should
be always INFO, because Dataflow doesn’t accept DEBUG.level.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)