karuppayya commented on pull request #28686:
URL: https://github.com/apache/spark/pull/28686#issuecomment-652613180


   @viirya @maropu 
   I relooked the code, the stats from Hive table relation are propagated only 
for [Partitioned 
tables](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L200).
 Also  in `DetermineTableStats` we compute the stats only for [non-partitioned 
table](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L121).
 I think for a partitioned table it will also be `spark.sql.defaultSizeInBytes`
   In case of a non-partitioned table, the HadoopFSRelation created uses 
InMemoryFileIndex which does not use the stats computed and does a separate 
listing to figure the stats. 
   Let me know if I am missing something here
   
   To add to how this change is useful, I took the example of q17.sql TPCDS 
query on scale 1000, non-partitioned data
   Without this change, the following is the query metrics for the query 
planning phase
   ```
   scala> val df = sql(query)
   scala> df.queryExecution.tracker.topRulesByTime(2).foreach(println)
   (org.apache.spark.sql.hive.DetermineTableStats,RuleSummary(55677175448, 3, 
3))
   
(org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions,RuleSummary(37485411305,
 6, 0))
   ```
   The time is also pretty high due to SPARK-31850.
   
   The stats computed is not used and can be avoided completely.
   Let me know your thoughts.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to