Hi Folks, I am working on Spark Dataframe API where I am doing following thing:
1) df = spark.sql("some sql on huge dataset").persist() 2) df1 = df.count() 3) df.repartition().write.mode().parquet("") AFAIK, persist should be used after count statement if at all it is needed to be used since spark is lazily evaluated and if I call any action it will recompute the above code and hence no use of persisting it before action. Therefore, it should be something like the below that should give better performance. 1) df= spark.sql("some sql on huge dataset") 2) df1 = df.count() 3) df.persist() 4) df.repartition().write.mode().parquet("") So please help me to understand how it should be exactly and why? If I am not correct Thanks, Sid