[
https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-44306:
-----------------------------------
Labels: pull-request-available (was: )
> Group FileStatus with few RPC calls within Yarn Client
> ------------------------------------------------------
>
> Key: SPARK-44306
> URL: https://issues.apache.org/jira/browse/SPARK-44306
> Project: Spark
> Issue Type: New Feature
> Components: Spark Submit
> Affects Versions: 0.9.2, 2.3.0, 3.5.0
> Reporter: SHU WANG
> Priority: Major
> Labels: pull-request-available
>
> It's inefficient to obtain *FileStatus* for each resource [one by
> one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1].
> In our company setting, we are running Spark with Hadoop Yarn and HDFS. We
> noticed the current behavior has two major drawbacks:
> # Since each *getFileStatus* call involves network delays, the overall delay
> can be *large* and add *uncertainty* to the overall Spark job runtime.
> Specifically, we quantify this overhead within our cluster. We see the p50
> overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is
> overloaded, the delays become more severe.
> # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS
> daily. We noticed that in our cluster, most resources come from the same HDFS
> directory for each user (See our [engineer blog
> post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-]
> about why we took this approach). Therefore, we can greatly reduce nearly
> 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily.
> This will further reduce overhead from the HDFS side.
> All in all, a more efficient way to fetch the *FileStatus* for each resource
> is highly needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]