Hello All, I run into situations where I ask myself should I write map partitions function on RDD or use dataframe all the way (with column + group by ) approach.. I am using Pyspark 2.3 (python 2.7).. I understand we should be utilizing dataframe as much as possible but at time it feels like RDD function would provide more flexible code .. Could you please advise? When to prefer one approach over the other.. (keeping pandas UDFs functions in mind, which approach makes more sense in what scenarios? )
Also how does it affect performance - that is using dataframe all the way vs RDD map partitions function? another question always arrises as to when to persist a dataframe? should we repartition before group by? If so, without persist - will it affect performance? Any help is much appreciated. Thanks, -Rishi