wangyum commented on issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in bytes is coming as default size in bytes ( 8.0 EB) URL: https://github.com/apache/spark/pull/22502#issuecomment-506421960 I think the correct approach should be to add a new rule(#24715) if the issue occurs at the table level. Actually, I have a long-term plan: 1. Data source tables support fallback to HDFS for size estimation #24715 2. Remove duplicate logic of calculate table size #24712 3. Persistent the table statistics to metadata after fall back to hdfs #24551 4. Refactor DetermineTableStats to invalidate cache when some configuration changed #22743 For example, after #24715: ```scala [root@spark-3267648 spark]# bin/spark-shell --conf spark.sql.statistics.fallBackToHdfs=true Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1561652081851). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create table table1 (id int, name string) using parquet partitioned by (name)") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into table1 values (1, 'a')") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("explain cost select * from table1").show(false) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |plan | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |== Optimized Logical Plan == Relation[id#2,name#3] parquet, Statistics(sizeInBytes=421.0 B) == Physical Plan == *(1) FileScan parquet default.table1[id#2,name#3] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/root/opensource/spark/spark-warehouse/table1], PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
