Thanks, I looked into these options, the cleaner periodic interval is set to 30 min by default. The block option for shuffle - *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by default. What are the implications of setting it to true? Will it make the driver slower?
Thanks, Alex On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail < prathmesh.ran...@gmail.com> wrote: > This is the job of ContextCleaner. There are few a property that you can > tweak to see if that helps: > spark.cleaner.periodicGC.interval > spark.cleaner.referenceTracking > spark.cleaner.referenceTracking.blocking.shuffle > > Regards > Prathmesh Ranaut > > On Jul 21, 2019, at 11:31 AM, Alex Landa <metalo...@gmail.com> wrote: > > Hi, > > We are running a long running Spark application ( which executes lots of > quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0. > We see that old shuffle files ( a week old for example ) are not deleted > during the execution of the application, which leads to out of disk space > errors on the executor. > If we re-deploy the application, the Spark cluster take care of the > cleaning > and deletes the old shuffle data (since we have > /-Dspark.worker.cleanup.enabled=true/ in the worker config). > I don't want to re-deploy our app every week or two, but to be able to > configure spark to clean old shuffle data (as it should). > > How can I configure Spark to delete old shuffle data during the life time > of > the application (not after)? > > > Thanks, > Alex > >