Hi, We are facing this issue when we convert RDD -> Dataset followed by repartition + write. We are using spot instances on k8s which means they can die at any moment. And when they do during this phase, we very often see data duplication happening.
Pseudo job code: val rdd = data.map(…) val ds = spark.createDataset(rdd, classEncoder) .repartition(N) .write .format(“parquet”) .mode(“overwrite”) .save(path) If I kill an executor pod during the repartition stage I can reproduce the issue. If I instead move the repartition to happen on the rdd instead of the dataset I cannot reproduce the issue. Is this a bug in spark lineage when going from rdd -> ds/df -> repartition when an executor drops? There is no randomness in the map function on the rdd before you ask 😊 Thanks, Erik