[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kennedy updated SPARK-25888:
---------------------------------
    Environment: 
Large YARN cluster with 1,000 nodes, 50,000 cores and 250 users, with 
predominantly Spark workloads including a mixture of ETL, Ad Hoc and PySpark 
Notebook jobs.

No streaming Spark.

> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-25888
>                 URL: https://issues.apache.org/jira/browse/SPARK-25888
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Shuffle, YARN
>    Affects Versions: 2.3.2
>         Environment: Large YARN cluster with 1,000 nodes, 50,000 cores and 
> 250 users, with predominantly Spark workloads including a mixture of ETL, Ad 
> Hoc and PySpark Notebook jobs.
> No streaming Spark.
>            Reporter: Adam Kennedy
>            Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with large populations 
> and diverse job execution often display terrible utilization rates (we have 
> observed as low as 3-7% CPU at max container allocation) due in large part to 
> difficulties with persist() blocks (DISK or DISK+MEMORY) preventing dynamic 
> deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, the job could spool up to say 10,000 
> cores, persist() data, release executors across a long tail (to say 100 
> cores) and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to