Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: >

Cache after filter Vs Writing back to HDFS

2015-09-17 Thread Gavin Yue
For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to