fuwhu commented on a change in pull request #27213: [SPARK-30516][SQL] 
statistic estimation of FileScan should take partitionFilters and partition 
number into account
URL: https://github.com/apache/spark/pull/27213#discussion_r367290358
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -1392,6 +1392,16 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VIA_FS =
+    buildConf("spark.sql.statistics.statisticViaFileSystem.maxPartitionNumber")
+      .doc("If the number of table (can be either hive table or data source 
table ) partitions " +
+        "exceed this value, statistic calculation via file system is not 
allowed. This is to " +
+        "avoid calculating size of large number of partitions via file system, 
eg. HDFS, which " +
+        "is very time consuming. Setting this value to negative will disable 
statistic " +
+        "calculation via file system.")
+      .intConf
+      .createWithDefault(1000)
+
 
 Review comment:
   This config is proposed in #27129 , will resolve the conflict after #27129 
finished.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to