[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kennedy updated SPARK-25888:
---------------------------------
    Description: 
Large and highly multi-tenant Spark on YARN clusters with diverse job execution 
often display terrible utilization rates (we have observed as low as 3-7% CPU 
at max container allocation, but 50% CPU utilization on even a well policed 
cluster is not uncommon).

As a sizing example, one of our larger instances has 1,000 nodes, 50,000 cores, 
250 users and 1,000 distinct applications run per week, with predominantly 
Spark including a mixture of ETL, Ad Hoc tasks and PySpark Notebook jobs (no 
streaming)

Utilization problems appear to be due in large part to difficulties with 
persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.

In situations where an external shuffle service is present (which is typical on 
clusters of this type) we already solve this for the shuffle block case by 
offloading the IO handling of shuffle blocks to the external service, allowing 
dynamic deallocation to proceed.

Allowing Executors to transfer persist() blocks to some external "shuffle" 
service in a similar manner would be an enormous win for Spark multi-tenancy as 
it would limit deallocation blocking scenarios to only MEMORY-only cache() 
scenarios.

I'm not sure if I'm correct, but I seem to recall seeing in the original 
external shuffle service commits that may have been considered at the time but 
getting shuffle blocks moved to the external shuffle service was the first 
priority.

With support for external persist() DISK blocks in place, we could also then 
handle deallocation of DISK+MEMORY, as the memory instance could first be 
dropped, changing the block to DISK only, and then further transferred to the 
shuffle service.

We have tried to resolve the persist() issue via extensive user training, but 
that has typically only allowed us to improve utilization of the worst 
offenders (10% utilization) up to around 40-60% utilization, as the need for 
persist() is often legitimate and occurs during the middle stages of a job.

In a healthy multi-tenant scenario, the job could spool up to say 10,000 cores, 
persist() data, release executors across a long tail (to say 100 cores) and 
then spool back up to 10,000 cores for the following stage without impact on 
the persist() data.

In an ideal world, if an new executor started up on a node on which blocks had 
been transferred to the shuffle service, the new executor might even be able to 
"recapture" control of those blocks (if that would help with performance in 
some way).

  was:
Large and highly multi-tenant Spark on YARN clusters with large populations and 
diverse job execution often display terrible utilization rates (we have 
observed as low as 3-7% CPU at max container allocation) due in large part to 
difficulties with persist() blocks (DISK or DISK+MEMORY) preventing dynamic 
deallocation.

As a sizing example, one of our larger instances has 1,000 nodes, 50,000 cores, 
250 users and 1,000 distinct applications run per week, with predominantly 
Spark workloads including a mixture of ETL, Ad Hoc and PySpark Notebook jobs, 
plus Hive and some MapReduce (no streaming Spark)

In situations where an external shuffle service is present (which is typical on 
clusters of this type) we already solve this for the shuffle block case by 
offloading the IO handling of shuffle blocks to the external service, allowing 
dynamic deallocation to proceed.

Allowing Executors to transfer persist() blocks to some external "shuffle" 
service in a similar manner would be an enormous win for Spark multi-tenancy as 
it would limit deallocation blocking scenarios to only MEMORY-only cache() 
scenarios.

I'm not sure if I'm correct, but I seem to recall seeing in the original 
external shuffle service commits that may have been considered at the time but 
getting shuffle blocks moved to the external shuffle service was the first 
priority.

With support for external persist() DISK blocks in place, we could also then 
handle deallocation of DISK+MEMORY, as the memory instance could first be 
dropped, changing the block to DISK only, and then further transferred to the 
shuffle service.

We have tried to resolve the persist() issue via extensive user training, but 
that has typically only allowed us to improve utilization of the worst 
offenders (10% utilization) up to around 40-60% utilization, as the need for 
persist() is often legitimate and occurs during the middle stages of a job.

In a healthy multi-tenant scenario, the job could spool up to say 10,000 cores, 
persist() data, release executors across a long tail (to say 100 cores) and 
then spool back up to 10,000 cores for the following stage without impact on 
the persist() data.

In an ideal world, if an new executor started up on a node on which blocks had 
been transferred to the shuffle service, the new executor might even be able to 
"recapture" control of those blocks (if that would help with performance in 
some way).


> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-25888
>                 URL: https://issues.apache.org/jira/browse/SPARK-25888
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Shuffle, YARN
>    Affects Versions: 2.3.2
>            Reporter: Adam Kennedy
>            Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job 
> execution often display terrible utilization rates (we have observed as low 
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a 
> well policed cluster is not uncommon).
> As a sizing example, one of our larger instances has 1,000 nodes, 50,000 
> cores, 250 users and 1,000 distinct applications run per week, with 
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark 
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with 
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, the job could spool up to say 10,000 
> cores, persist() data, release executors across a long tail (to say 100 
> cores) and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to