[ 
https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27232:
-------------------------------
    Description: 
`InMemoryFileIndex` needs to request file block location information in order 
to do locality schedule in `TaskSetManager`. 

Usually this is a time-cost task.  For example, In our production env, there 
are 24 partitions, with totally 149925 files and 83TB in size. It costs about 
10 minutes to request file block locations before submit a spark job. Even 
though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 to 
make it parallelized, it also needs 2 minutes. 

Anyway, this is a waste if we don't care about the locality of files(for 
example, storage and computation are separate).

So there should be a conf to control whether we need to send 
`getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` 
to 0, file block location information is meaningless. 

Here in this PR, if `spark.locality.wait` is set to 0, it will not request file 
location information anymore, which will save several seconds to minutes.


> Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27232
>                 URL: https://issues.apache.org/jira/browse/SPARK-27232
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: EdisonWang
>            Priority: Minor
>
> `InMemoryFileIndex` needs to request file block location information in order 
> to do locality schedule in `TaskSetManager`. 
> Usually this is a time-cost task.  For example, In our production env, there 
> are 24 partitions, with totally 149925 files and 83TB in size. It costs about 
> 10 minutes to request file block locations before submit a spark job. Even 
> though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 
> to make it parallelized, it also needs 2 minutes. 
> Anyway, this is a waste if we don't care about the locality of files(for 
> example, storage and computation are separate).
> So there should be a conf to control whether we need to send 
> `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` 
> to 0, file block location information is meaningless. 
> Here in this PR, if `spark.locality.wait` is set to 0, it will not request 
> file location information anymore, which will save several seconds to minutes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to