[GitHub] [spark] wangyum commented on issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in bytes is coming as default size in bytes ( 8.0 EB)

GitBox Thu, 27 Jun 2019 09:39:05 -0700

wangyum commented on issue #22502: [SPARK-25474][SQL]When the 
"fallBackToHdfsForStats= true", Size in bytes is coming as default size in 
bytes ( 8.0 EB)
URL: https://github.com/apache/spark/pull/22502#issuecomment-506421960
 
 
   I think the correct approach should be to add a new rule(#24715) if the 
issue occurs at the table level.
   Actually, I have a long-term plan:
   1. Data source tables support fallback to HDFS for size estimation #24715
   2. Remove duplicate logic of calculate table size #24712
   3. Persistent the table statistics to metadata after fall back to hdfs #24551
   4. Refactor DetermineTableStats to invalidate cache when some configuration 
changed #22743
   
   For example, after #24715:
   ```scala
   [root@spark-3267648 spark]# bin/spark-shell --conf 
spark.sql.statistics.fallBackToHdfs=true
   Spark context Web UI available at 
http://spark-3267648.lvs02.dev.ebayc3.com:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1561652081851).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
         /_/
   
   Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_211)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> spark.sql("create table table1 (id int, name string) using parquet 
partitioned by (name)")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("insert into table1 values (1, 'a')")
   res1: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("explain cost select * from table1").show(false)
   
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   |plan                                                                        
                                                                                
                                                                                
                                                                                
                                                                |
   
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   |== Optimized Logical Plan ==
   Relation[id#2,name#3] parquet, Statistics(sizeInBytes=421.0 B)
   
   == Physical Plan ==
   *(1) FileScan parquet default.table1[id#2,name#3] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
CatalogFileIndex[file:/root/opensource/spark/spark-warehouse/table1], 
PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<id:int>
   
   |
   
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangyum commented on issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in bytes is coming as default size in bytes ( 8.0 EB)

Reply via email to