[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file
[ https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023340#comment-17023340 ] Reynold Xin commented on SPARK-29189: - This is great, but how would users know when to set this? Shouldn't we do a slight incremental improvement to just automatically detect the common object stores and disable locality check? > Add an option to ignore block locations when listing file > - > > Key: SPARK-29189 > URL: https://issues.apache.org/jira/browse/SPARK-29189 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Assignee: Wang, Gang >Priority: Major > Fix For: 3.0.0 > > > In our PROD env, we have a pure Spark cluster, I think this is also pretty > common, where computation is separated from storage layer. In such deploy > mode, data locality is never reachable. > And there are some configurations in Spark scheduler to reduce waiting time > for data locality(e.g. "spark.locality.wait"). While, problem is that, in > listing file phase, the location informations of all the files, with all the > blocks inside each file, are all fetched from the distributed file system. > Actually, in a PROD environment, a table can be so huge that even fetching > all these location informations need take tens of seconds. > To improve such scenario, Spark need provide an option, where data locality > can be totally ignored, all we need in the listing file phase are the files > locations, without any block location informations. > > And we made a benchmark in our PROD env, after ignore the block locations, we > got a pretty huge improvement. > ||Table Size||Total File Number||Total Block Number||List File Duration(With > Block Location)||List File Duration(Without Block Location)|| > |22.6T|3|12|16.841s|1.730s| > |28.8 T|42001|148964|10.099s|2.858s| > |3.4 T|2| 2|5.833s|4.881s| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file
[ https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946176#comment-16946176 ] Imran Rashid commented on SPARK-29189: -- Fixed by pr https://github.com/apache/spark/pull/25869 > Add an option to ignore block locations when listing file > - > > Key: SPARK-29189 > URL: https://issues.apache.org/jira/browse/SPARK-29189 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Assignee: Wang, Gang >Priority: Major > > In our PROD env, we have a pure Spark cluster, I think this is also pretty > common, where computation is separated from storage layer. In such deploy > mode, data locality is never reachable. > And there are some configurations in Spark scheduler to reduce waiting time > for data locality(e.g. "spark.locality.wait"). While, problem is that, in > listing file phase, the location informations of all the files, with all the > blocks inside each file, are all fetched from the distributed file system. > Actually, in a PROD environment, a table can be so huge that even fetching > all these location informations need take tens of seconds. > To improve such scenario, Spark need provide an option, where data locality > can be totally ignored, all we need in the listing file phase are the files > locations, without any block location informations. > > And we made a benchmark in our PROD env, after ignore the block locations, we > got a pretty huge improvement. > ||Table Size||Total File Number||Total Block Number||List File Duration(With > Block Location)||List File Duration(Without Block Location)|| > |22.6T|3|12|16.841s|1.730s| > |28.8 T|42001|148964|10.099s|2.858s| > |3.4 T|2| 2|5.833s|4.881s| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org