Re: Long-Running Spark application doesn't clean old shuffle data correctly

Alex Landa Sun, 21 Jul 2019 00:20:04 -0700

Thanks,
I looked into these options, the cleaner periodic interval is set to 30 min
by default.
The block option for shuffle -
*spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
default.
What are the implications of setting it to true?
Will it make the driver slower?


Thanks,
Alex

On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
prathmesh.ran...@gmail.com> wrote:

> This is the job of ContextCleaner. There are few a property that you can
> tweak to see if that helps:
> spark.cleaner.periodicGC.interval
> spark.cleaner.referenceTracking
> spark.cleaner.referenceTracking.blocking.shuffle
>
> Regards
> Prathmesh Ranaut
>
> On Jul 21, 2019, at 11:31 AM, Alex Landa <metalo...@gmail.com> wrote:
>
> Hi,
>
> We are running a long running Spark application ( which executes lots of
> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
> We see that old shuffle files ( a week old for example ) are not deleted
> during the execution of the application, which leads to out of disk space
> errors on the executor.
> If we re-deploy the application, the Spark cluster take care of the
> cleaning
> and deletes the old shuffle data (since we have
> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
> I don't want to re-deploy our app every week or two, but to be able to
> configure spark to clean old shuffle data (as it should).
>
> How can I configure Spark to delete old shuffle data during the life time
> of
> the application (not after)?
>
>
> Thanks,
> Alex
>
>

Re: Long-Running Spark application doesn't clean old shuffle data correctly

Reply via email to