fuwhu commented on a change in pull request #27213: [SPARK-30516][SQL]
statistic estimation of FileScan should take partitionFilters and partition
number into account
URL: https://github.com/apache/spark/pull/27213#discussion_r367290358
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1392,6 +1392,16 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VIA_FS =
+ buildConf("spark.sql.statistics.statisticViaFileSystem.maxPartitionNumber")
+ .doc("If the number of table (can be either hive table or data source
table ) partitions " +
+ "exceed this value, statistic calculation via file system is not
allowed. This is to " +
+ "avoid calculating size of large number of partitions via file system,
eg. HDFS, which " +
+ "is very time consuming. Setting this value to negative will disable
statistic " +
+ "calculation via file system.")
+ .intConf
+ .createWithDefault(1000)
+
Review comment:
This config is proposed in #27129 , will resolve the conflict after #27129
finished.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]