wangshisan opened a new pull request #25869: [SPARK-29189] Add an option to 
ignore block locations when listing file
URL: https://github.com/apache/spark/pull/25869
 
 
   
   ### What changes were proposed in this pull request?
   In our PROD env, we have a pure Spark cluster, I think this is also pretty 
common, where computation is separated from storage layer. In such deploy mode, 
data locality is never reachable.
   And there are some configurations in Spark scheduler to reduce waiting time 
for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
listing file phase, the location informations of all the files, with all the 
blocks inside each file, are all fetched from the distributed file system. 
Actually, in a PROD environment, a table can be so huge that even fetching all 
these location informations need take tens of seconds.
   To improve such scenario, Spark need provide an option, where data locality 
can be totally ignored, all we need in the listing file phase are the files 
locations, without any block location informations.
   
   
   ### Why are the changes needed?
   And we made a benchmark in our PROD env, after ignore the block locations, 
we got a pretty huge improvement. 
https://issues.apache.org/jira/browse/SPARK-29189
   
   ### Does this PR introduce any user-facing change?
   No.
   
   
   ### How was this patch tested?
   In our PROD environment.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to