Our use case is as follows: We repartition 6 months worth of data for each client on clientId & recordcreationdate, so that it can write one file per partition. Our partition is on client and recordcreationdate.
The job fills up the disk after it process say 30 tenants out of 50. I am looking for a way to clear the shuffle files once the jobs finishes writing to the disk for a client before it moves on to next. We process a client or group of clients (depends on data size) in one go, sparksession is shared. We noticed that once you create a new sparksession it clears the disk. But new sparksession is not a option for us. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org