davidpr91 commented on pull request #15576: URL: https://github.com/apache/beam/pull/15576#issuecomment-930092030
Hi @chamikaramj, @pabloem, <img width="953" alt="Captura de pantalla 2021-09-24 a las 8 37 20" src="https://user-images.githubusercontent.com/2864462/135258362-206c35e3-74a6-4945-b5fa-0e7d8bb22514.png"> In this screenshot you can see the logs of what happened during an execution with the following conditions: - This line was commented: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677 - Running in Dataflow. Logs were switched from DEBUG to INFO to be able to see them in Dataflow logs. As you can see: - first it writes the file in the temporary folder (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L715). - Right after it tried to delete the same file (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L676). Taking into account the small time difference between the write and delete logs, I think it could be a race condition. - Since we had the deletion commented, it's not really deleted and afterwards, it's moved to the final destination: (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L638) If the orphaned file had been deleted, it would never have been moved to the final destination. That's why I propose the workaround in this Pull Request. Let me know if you need further details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
