[
https://issues.apache.org/jira/browse/BEAM-12950?focusedWorklogId=657171&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-657171
]
ASF GitHub Bot logged work on BEAM-12950:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 29/Sep/21 11:32
Start Date: 29/Sep/21 11:32
Worklog Time Spent: 10m
Work Description: davidpr91 commented on pull request #15576:
URL: https://github.com/apache/beam/pull/15576#issuecomment-930092030
Hi @chamikaramj, @pabloem,
<img width="953" alt="Captura de pantalla 2021-09-24 a las 8 37 20"
src="https://user-images.githubusercontent.com/2864462/135258362-206c35e3-74a6-4945-b5fa-0e7d8bb22514.png">
In this screenshot you can see the logs of what happened during an execution
with the following conditions:
- This line was commented:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677
- Running in Dataflow. Logs were switched from DEBUG to INFO to be able to
see them in Dataflow logs.
As you can see:
- first it writes the file in the temporary folder
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L715).
- Right after it tried to delete the same file
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L676).
Taking into account the small time difference between the write and delete
logs, I think it could be a race condition.
- Since we had the deletion commented, it's not really deleted and
afterwards, it's moved to the final destination:
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L638)
If the orphaned file had been deleted, it would never have been moved to the
final destination. That's why I propose the workaround in this Pull Request.
Let me know if you need further details.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 657171)
Time Spent: 1.5h (was: 1h 20m)
> Missing events when using Python WriteToFiles in streaming pipeline
> -------------------------------------------------------------------
>
> Key: BEAM-12950
> URL: https://issues.apache.org/jira/browse/BEAM-12950
> Project: Beam
> Issue Type: Bug
> Components: io-py-files
> Affects Versions: 2.32.0
> Reporter: David
> Priority: P1
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> We have a Python streaming pipeline consuming events from PubSub and writing
> them into GCS, in Dataflow.
> After performing some tests, we realized that we were missing events. The
> reason was that some files were being deleted from the temporary folder
> before moving them to the destination.
> In the logs we can see that there might be a race condition: the code checks
> if the file exists in temp folder before it has been actually created, Thus
> it’s not moved to the destination. Afterwards, the file is considered
> orphaned and is deleted from the temp folder in this line:
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677].
> Since files are being moved from the temp folder to the final destination,
> they shouldn’t be deleted in any case, otherwise we would lose events. For
> what purpose we have “orphaned files”? Should they exist?
> We know that Python SDK WriteToFiles is still experimental, but missing
> events looks like a big deal and that's why I've created it as P1. If you
> think it should be lowered let me know.
> The easiest and safest approach right now would be to not delete any files
> and log a message to warn that some files might be left orphaned in the
> temporary folder. Eventually, in the next execution, that orphaned file will
> be moved to the destination, so we don’t lose any event. In addition, the log
> level should be always INFO, because Dataflow doesn’t accept DEBUG.level.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)