Our use case is as follows:
   We repartition 6 months worth of data for each client on clientId &
recordcreationdate, so that it can write one file per partition. Our
partition is on client and recordcreationdate. 

The job fills up the disk after it process say 30 tenants out of 50.  I am
looking for a way to clear the shuffle files once the jobs finishes writing
to the disk for a client before it moves on to next.

We process a client or group of clients (depends on data size) in one go,
sparksession is shared. We noticed that once you create a new sparksession
it clears the disk. But new sparksession is not a option for us.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to