Hello All,

I run into situations where I ask myself should I write map partitions
function on RDD or use dataframe all the way (with column + group by )
approach.. I am using Pyspark 2.3 (python 2.7).. I understand we should be
utilizing dataframe as much as possible but at time it feels like RDD
function would provide more flexible code .. Could you please advise? When
to prefer one approach over the other.. (keeping pandas UDFs functions in
mind, which approach makes more sense in what scenarios? )

Also how does it affect performance - that is using dataframe all the way
vs RDD map partitions function?

another question always arrises as to when to persist a dataframe? should
we repartition before group by? If so, without persist - will it affect
performance?

Any help is much appreciated.

Thanks,
-Rishi

Reply via email to