[
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-25888:
------------------------------------
Assignee: (was: Apache Spark)
> Service requests for persist() blocks via external service after dynamic
> deallocation
> -------------------------------------------------------------------------------------
>
> Key: SPARK-25888
> URL: https://issues.apache.org/jira/browse/SPARK-25888
> Project: Spark
> Issue Type: New Feature
> Components: Block Manager, Shuffle, YARN
> Affects Versions: 2.3.2
> Reporter: Adam Kennedy
> Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job
> execution often display terrible utilization rates (we have observed as low
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a
> well policed cluster is not uncommon).
> As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250
> users and 50,000 runs of 1,000 distinct applications per week, with
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical
> on clusters of this type) we already solve this for the shuffle block case by
> offloading the IO handling of shuffle blocks to the external service,
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle"
> service in a similar manner would be an enormous win for Spark multi-tenancy
> as it would limit deallocation blocking scenarios to only MEMORY-only cache()
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original
> external shuffle service commits that may have been considered at the time
> but getting shuffle blocks moved to the external shuffle service was the
> first priority.
> With support for external persist() DISK blocks in place, we could also then
> handle deallocation of DISK+MEMORY, as the memory instance could first be
> dropped, changing the block to DISK only, and then further transferred to the
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but
> that has typically only allowed us to improve utilization of the worst
> offenders (10% utilization) up to around 40-60% utilization, as the need for
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, a large job might spool up to say 10,000
> cores, persist() data, release executors across a long tail down to 100
> cores, and then spool back up to 10,000 cores for the following stage without
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks
> had been transferred to the shuffle service, the new executor might even be
> able to "recapture" control of those blocks (if that would help with
> performance in some way).
> And the behavior of gradually expanding up and down several times over the
> course of a job would not just improve utilization, but would allow resources
> to more easily be redistributed to other jobs which start on the cluster
> during the long-tail periods, which would improve multi-tenancy and bring us
> closer to optimal "envy free" YARN scheduling.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]