[GitHub] [spark] fuwhu commented on a change in pull request #27129: [SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System

GitBox Thu, 30 Jan 2020 20:03:17 -0800

fuwhu commented on a change in pull request #27129: [SPARK-30427][SQL] Add 
config item for limiting partition number when calculating statistics through 
File System
URL: https://github.com/apache/spark/pull/27129#discussion_r373307935


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
 ##########
 @@ -59,6 +60,21 @@ private[sql] object PruneFileSourcePartitions extends 
Rule[LogicalPlan] {
     Project(projects, withFilter)
   }
 
+  private def getNewStatistics(
+              fileIndex: InMemoryFileIndex,
 
 Review comment:
   > @fuwhu `InMemoryFileIndex` caches all the file status on construction. Is 
it true that statistic calculation is very expensive?
   
   yea, you are right. The refresh0 method will call listLeafFiles to get all 
file status, which already implement listing in parallel when path number 
exceed threshold.
   So the statistic calculation here is actually just a sum of the length of 
all leaf files. Will remove the conf check here.
   
   But for hive table, currently, there is no some place to get the file/dir 
status directly without accessing the file system. So i think it is still 
necessary to limit the partition number of statistic calculation. WDYT ?
   
   For parallel statistic calculation, i am not sure whether it is worthwhile 
to start a distributed job to do the statistic calculation or start multiple 
threads to do it ? The benefit of statistic computation may not cover the cost 
of the statistic calculation. WDYT ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] fuwhu commented on a change in pull request #27129: [SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System

Reply via email to