Hi,

We are facing this issue when we convert RDD -> Dataset followed by repartition 
+ write. We are using spot instances on k8s which means they can die at any 
moment. And when they do during this phase, we very often see data duplication 
happening.

Pseudo job code:

val rdd = data.map(…)
val ds = spark.createDataset(rdd, classEncoder)
                .repartition(N)
                .write
                .format(“parquet”)
                .mode(“overwrite”)
                .save(path)

If I kill an executor pod during the repartition stage I can reproduce the 
issue. If I instead move the repartition to happen on the rdd instead of the 
dataset I cannot reproduce the issue.

Is this a bug in spark lineage when going from rdd -> ds/df -> repartition when 
an executor drops? There is no randomness in the map function on the rdd before 
you ask 😊

Thanks,
Erik

Reply via email to