davidpr91 commented on pull request #15576:
URL: https://github.com/apache/beam/pull/15576#issuecomment-930092030


   Hi @chamikaramj, @pabloem,
   
   <img width="953" alt="Captura de pantalla 2021-09-24 a las 8 37 20" 
src="https://user-images.githubusercontent.com/2864462/135258362-206c35e3-74a6-4945-b5fa-0e7d8bb22514.png";>
    
   In this screenshot you can see the logs of what happened during an execution 
with the following conditions:
   - This line was commented: 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L677
   - Running in Dataflow. Logs were switched from DEBUG to INFO to be able to 
see them in Dataflow logs.
   
   As you can see:
   -  first it writes the file in the temporary folder 
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L715).
 
   - Right after it tried to delete the same file 
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L676).
 Taking into account the small time difference between the write and delete 
logs, I think it could be a race condition.
   - Since we had the deletion commented, it's not really deleted and 
afterwards, it's moved to the final destination:
   
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L638)
   
   If the orphaned file had been deleted, it would never have been moved to the 
final destination. That's why I propose the workaround in this Pull Request.
   
   Let me know if you need further details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to