[ https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Kennedy updated SPARK-25888: --------------------------------- Environment: Large YARN cluster with 1,000 nodes, 50,000 cores and 250 users, with predominantly Spark workloads including a mixture of ETL, Ad Hoc and PySpark Notebook jobs. No streaming Spark. > Service requests for persist() blocks via external service after dynamic > deallocation > ------------------------------------------------------------------------------------- > > Key: SPARK-25888 > URL: https://issues.apache.org/jira/browse/SPARK-25888 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Shuffle, YARN > Affects Versions: 2.3.2 > Environment: Large YARN cluster with 1,000 nodes, 50,000 cores and > 250 users, with predominantly Spark workloads including a mixture of ETL, Ad > Hoc and PySpark Notebook jobs. > No streaming Spark. > Reporter: Adam Kennedy > Priority: Major > > Large and highly multi-tenant Spark on YARN clusters with large populations > and diverse job execution often display terrible utilization rates (we have > observed as low as 3-7% CPU at max container allocation) due in large part to > difficulties with persist() blocks (DISK or DISK+MEMORY) preventing dynamic > deallocation. > In situations where an external shuffle service is present (which is typical > on clusters of this type) we already solve this for the shuffle block case by > offloading the IO handling of shuffle blocks to the external service, > allowing dynamic deallocation to proceed. > Allowing Executors to transfer persist() blocks to some external "shuffle" > service in a similar manner would be an enormous win for Spark multi-tenancy > as it would limit deallocation blocking scenarios to only MEMORY-only cache() > scenarios. > I'm not sure if I'm correct, but I seem to recall seeing in the original > external shuffle service commits that may have been considered at the time > but getting shuffle blocks moved to the external shuffle service was the > first priority. > With support for external persist() DISK blocks in place, we could also then > handle deallocation of DISK+MEMORY, as the memory instance could first be > dropped, changing the block to DISK only, and then further transferred to the > shuffle service. > We have tried to resolve the persist() issue via extensive user training, but > that has typically only allowed us to improve utilization of the worst > offenders (10% utilization) up to around 40-60% utilization, as the need for > persist() is often legitimate and occurs during the middle stages of a job. > In a healthy multi-tenant scenario, the job could spool up to say 10,000 > cores, persist() data, release executors across a long tail (to say 100 > cores) and then spool back up to 10,000 cores for the following stage without > impact on the persist() data. > In an ideal world, if an new executor started up on a node on which blocks > had been transferred to the shuffle service, the new executor might even be > able to "recapture" control of those blocks (if that would help with > performance in some way). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org