[
https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-27232:
----------------------------------
Affects Version/s: (was: 2.4.0)
3.0.0
> Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
> --------------------------------------------------------------------------
>
> Key: SPARK-27232
> URL: https://issues.apache.org/jira/browse/SPARK-27232
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: EdisonWang
> Priority: Minor
>
> `InMemoryFileIndex` needs to request file block location information in order
> to do locality schedule in `TaskSetManager`.
> Usually this is a time-cost task. For example, In our production env, there
> are 24 partitions, with totally 149925 files and 83TB in size. It costs about
> 10 minutes to request file block locations before submit a spark job. Even
> though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24
> to make it parallelized, it also needs 2 minutes.
> Anyway, this is a waste if we don't care about the locality of files(for
> example, storage and computation are separate).
> So there should be a conf to control whether we need to send
> `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait`
> to 0, file block location information is meaningless.
> Here in this PR, if `spark.locality.wait` is set to 0, it will not request
> file location information anymore, which will save several seconds to minutes.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]