[GitHub] [beam] damccorm opened a new issue, #21082: FileIO can produce duplicates in output files

GitBox Sat, 04 Jun 2022 14:03:50 -0700


damccorm opened a new issue, #21082:
URL: https://github.com/apache/beam/issues/21082


   FileIO can produce duplicates in output files - depending on a runner.
   
   Concrete example for Spark when executing as batch:
   
   When using FileIO with specific number of shards, it will use default 
sharding function which is a round robin shard assignment with random seed. In 
multistage pipeline, data between stages are hold by shuffle service until 
downstream stage request it for further computations. If shuffle results 
computed with this seeded shard function are lost - e.g. shuffle service fails 
because of HW error - then Spark will attempt to recover data by computing them 
again from source data. As a result of a random seed sharding, this will assign 
different shard - and therefore key to the element.
   
   More details are discussed in this thread:
   
https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E
   
   Imported from Jira 
[BEAM-12654](https://issues.apache.org/jira/browse/BEAM-12654). Original Jira 
may contain additional context.
   Reported by: jvilcek.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #21082: FileIO can produce duplicates in output files

Reply via email to