[ 
https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023340#comment-17023340
 ] 

Reynold Xin commented on SPARK-29189:
-------------------------------------

This is great, but how would users know when to set this? Shouldn't we do a 
slight incremental improvement to just automatically detect the common object 
stores and disable locality check?

> Add an option to ignore block locations when listing file
> ---------------------------------------------------------
>
>                 Key: SPARK-29189
>                 URL: https://issues.apache.org/jira/browse/SPARK-29189
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Wang, Gang
>            Assignee: Wang, Gang
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In our PROD env, we have a pure Spark cluster, I think this is also pretty 
> common, where computation is separated from storage layer. In such deploy 
> mode, data locality is never reachable. 
>  And there are some configurations in Spark scheduler to reduce waiting time 
> for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
> listing file phase, the location informations of all the files, with all the 
> blocks inside each file, are all fetched from the distributed file system. 
> Actually, in a PROD environment, a table can be so huge that even fetching 
> all these location informations need take tens of seconds.
>  To improve such scenario, Spark need provide an option, where data locality 
> can be totally ignored, all we need in the listing file phase are the files 
> locations, without any block location informations.
>  
> And we made a benchmark in our PROD env, after ignore the block locations, we 
> got a pretty huge improvement.
> ||Table Size||Total File Number||Total Block Number||List File Duration(With 
> Block Location)||List File Duration(Without Block Location)||
> |22.6T|30000|120000|16.841s|1.730s|
> |28.8 T|42001|148964|10.099s|2.858s|
> |3.4 T|20000| 20000|5.833s|4.881s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to