fuwhu commented on a change in pull request #27129: [SPARK-30427][SQL] Add
config item for limiting partition number when calculating statistics through
File System
URL: https://github.com/apache/spark/pull/27129#discussion_r373307935
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
##########
@@ -59,6 +60,21 @@ private[sql] object PruneFileSourcePartitions extends
Rule[LogicalPlan] {
Project(projects, withFilter)
}
+ private def getNewStatistics(
+ fileIndex: InMemoryFileIndex,
Review comment:
> @fuwhu `InMemoryFileIndex` caches all the file status on construction. Is
it true that statistic calculation is very expensive?
yea, you are right. The refresh0 method will call listLeafFiles to get all
file status, which already implement listing in parallel when path number
exceed threshold.
So the statistic calculation here is actually just a sum of the length of
all leaf files. Will remove the conf check here.
But for hive table, currently, there is no some place to get the file/dir
status directly without accessing the file system. So i think it is still
necessary to limit the partition number of statistic calculation. WDYT ?
For parallel statistic calculation, i am not sure whether it is worthwhile
to start a distributed job to do the statistic calculation or start multiple
threads to do it ? The benefit of statistic computation may not cover the cost
of the statistic calculation. WDYT ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]