Jozef Vilcek created BEAM-12654:
-----------------------------------
Summary: FileIO can produce duplicates in output files
Key: BEAM-12654
URL: https://issues.apache.org/jira/browse/BEAM-12654
Project: Beam
Issue Type: Bug
Components: sdk-java-core
Reporter: Jozef Vilcek
FileIO can produce duplicates in output files - depending on a runner.
Concrete example for Spark when executing as batch:
When using FileIO with specific number of shards, it will use default sharding
function which is a round robin shard assignment with random seed. In
multistage pipeline, data between stages are hold by shuffle service until
downstream stage request it for further computations. If shuffle results
computed with this seeded shard function are lost - e.g. shuffle service fails
because of HW error - then Spark will attempt to recover data by computing them
again from source data. As a result of a random seed sharding, this will assign
different shard - and therefore key to the element.
More details are discussed in this thread:
https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E
--
This message was sent by Atlassian Jira
(v8.3.4#803005)