Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 
(https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/)
 - the performance of the new plugin is for many spark jobs sufficient (it 
works also on EMR). Then you can use s3 lifecycle policies to clean up/expire 
objects older than one day 
(https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
 - this then also cleans up files from crashed spark jobs.

For shuffle on disk you have not much choices as you mentioned. I would though 
avoid to have a long living app that loops - that never works so well on Spark 
(it is designed for batch jobs that eventually stop). Maybe you can simply 
trigger a new job when a new file arrives (s3 events ?).

> Am 18.02.2024 um 00:39 schrieb Saha, Daniel :
> 
> 
> Hi,
>  
> Background: I am running into executor disk space issues when running a 
> long-lived Spark 3.3 app with YARN on AWS EMR. The app performs back-to-back 
> spark jobs in a sequential loop with each iteration performing 100gb+ 
> shuffles. The files taking up the space are related to shuffle blocks [1]. 
> Disk is only cleared when restarting the YARN app. For all intents and 
> purposes, each job is independent. So once a job/iterator is complete, there 
> is no need to retain these shuffle files. I want to try stopping and 
> recreating the Spark context between loop iterations/jobs to indicate to 
> Spark DiskBlockManager that these intermediate results are no longer needed 
> [2].
>  
> Questions:
> Are there better ways to remove/clean the directory containing these old, no 
> longer used, shuffle results (aside from cron or restarting yarn app)?
> How to recreate the spark context within a single application? I see no 
> methods in Spark Session for doing this, and each new Spark session re-uses 
> the existing spark context. After stopping the SparkContext, SparkSession 
> does not re-create a new one. Further, creating a new SparkSession via 
> constructor and passing in a new SparkContext is not allowed as it is a 
> protected/private method.
>  
> Thanks
> Daniel
>  
> [1] 
> /mnt/yarn/usercache/hadoop/appcache/application_1706835946137_0110/blockmgr-eda47882-56d6-4248-8e30-a959ddb912c5
> [2] https://stackoverflow.com/a/38791921


Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Adam Binford
If you're using dynamic allocation it could be caused by executors with
shuffle data being deallocated before the shuffle is cleaned up. These
shuffle files will never get cleaned up once that happens until the Yarn
application ends. This was a big issue for us so I added support for
deleting shuffle data via the shuffle service for deallocated executors
that landed in Spark 3.3, but it is disabled by default. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L698
.

spark.shuffle.service.removeShuffle

If you're not using dynamic allocation then I'm not sure, shuffle data
should be deleted once it's no longer needed (through garbage collection
mechanisms referencing the shuffle). Maybe just make sure any variables
referencing the first DataFrame go out of scope.

Adam

On Sat, Feb 17, 2024 at 6:40 PM Saha, Daniel 
wrote:

> Hi,
>
>
>
> *Background*: I am running into executor disk space issues when running a
> long-lived Spark 3.3 app with YARN on AWS EMR. The app performs
> back-to-back spark jobs in a sequential loop with each iteration performing
> 100gb+ shuffles. The files taking up the space are related to shuffle
> blocks [1]. Disk is only cleared when restarting the YARN app. For all
> intents and purposes, each job is independent. So once a job/iterator is
> complete, there is no need to retain these shuffle files. I want to try
> stopping and recreating the Spark context between loop iterations/jobs to
> indicate to Spark DiskBlockManager that these intermediate results are no
> longer needed [2].
>
>
>
> *Questions*:
>
>- Are there better ways to remove/clean the directory containing these
>old, no longer used, shuffle results (aside from cron or restarting yarn
>app)?
>- How to recreate the spark context within a single application? I see
>no methods in Spark Session for doing this, and each new Spark session
>re-uses the existing spark context. After stopping the SparkContext,
>SparkSession does not re-create a new one. Further, creating a new
>SparkSession via constructor and passing in a new SparkContext is not
>allowed as it is a protected/private method.
>
>
>
> Thanks
>
> Daniel
>
>
>
> [1]
> /mnt/yarn/usercache/hadoop/appcache/application_1706835946137_0110/blockmgr-eda47882-56d6-4248-8e30-a959ddb912c5
>
> [2] https://stackoverflow.com/a/38791921
>


-- 
Adam Binford