Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by
date, however below spark program doesn't help - keeps failing due to *file
already exists exception..*

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried
spinning up new ones which write to the same .staging directory?

-- 
Regards,

Rishi Shah

Reply via email to