Jozef Vilcek created BEAM-12654:
-----------------------------------

             Summary: FileIO can produce duplicates in output files
                 Key: BEAM-12654
                 URL: https://issues.apache.org/jira/browse/BEAM-12654
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core
            Reporter: Jozef Vilcek


FileIO can produce duplicates in output files - depending on a runner.

Concrete example for Spark when executing as batch:

When using FileIO with specific number of shards, it will use default sharding 
function which is a round robin shard assignment with random seed. In 
multistage pipeline, data between stages are hold by shuffle service until 
downstream stage request it for further computations. If shuffle results 
computed with this seeded shard function are lost - e.g. shuffle service fails 
because of HW error - then Spark will attempt to recover data by computing them 
again from source data. As a result of a random seed sharding, this will assign 
different shard - and therefore key to the element.

More details are discussed in this thread:
https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to