When should we cache / persist ? After or Before Actions?

Sid Thu, 21 Apr 2022 00:26:01 -0700

Hi Folks,

I am working on Spark Dataframe API where I am doing following thing:


1) df = spark.sql("some sql on huge dataset").persist()
2) df1 = df.count()
3) df.repartition().write.mode().parquet("")


AFAIK, persist should be used after count statement if at all it is needed
to be used since spark is lazily evaluated and if I call any action it will
recompute the above code and hence no use of persisting it before action.

Therefore, it should be something like the below that should give better
performance.
1) df= spark.sql("some sql on huge dataset")
2) df1 = df.count()
3) df.persist()
4) df.repartition().write.mode().parquet("")

So please help me to understand how it should be exactly and why? If I am
not correct

Thanks,
Sid

When should we cache / persist ? After or Before Actions?

Reply via email to