David created BEAM-12950:
----------------------------

             Summary: Missing events when using Python WriteToFiles in 
streaming pipeline
                 Key: BEAM-12950
                 URL: https://issues.apache.org/jira/browse/BEAM-12950
             Project: Beam
          Issue Type: Bug
          Components: io-py-files
    Affects Versions: 2.32.0
            Reporter: David


We have a Python streaming pipeline consuming events from PubSub and writing 
them into GCS, in Dataflow.

After performing some tests, we realized that we were missing events. The 
reason was that some files were being deleted from the temporary folder before 
moving them to the destination.

In the logs we can see that there might be a race condition: the code checks if 
the file exists in temp folder before it has been actually created, Thus it’s 
not moved to the destination. Afterwards, the file is considered orphaned and 
is deleted from the temp folder in this line: 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677].

Since files are being moved from the temp folder to the final destination, they 
shouldn’t be deleted in any case, otherwise we would lose events. For what 
purpose we have “orphaned files”? Should they exist?

We know that Python SDK WriteToFiles is still experimental, but missing events 
looks like a big deal and that's why I've created it as P1. If you think it 
should be lowered let me know.

The easiest and safest approach right now would be to not delete any files and 
log a message to warn that some files might be left orphaned in the temporary 
folder. Eventually, in the next execution, that orphaned file will be moved to 
the destination, so we don’t lose any event. In addition, the log level should 
be always INFO, because Dataflow doesn’t accept DEBUG.level. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to