liupc commented on a change in pull request #27055: [SPARK-30394]Skip 
DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r375872617
 
 

 ##########
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
 ##########
 @@ -139,13 +139,15 @@ class DetermineTableStats(session: SparkSession) extends 
Rule[LogicalPlan] {
 
   override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
     case relation: HiveTableRelation
-      if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.isEmpty =>
+      if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.isEmpty &&
+        !RelationConversions.isConvertible(relation) =>
 
 Review comment:
   @viirya I agree that doing size estimation on demand and disregarding 
catalog statistics is expensive, what I actually want to do is skip the 
`fallBackToHdfs` code path for datasource tables and do size estimation in 
`HadoopFsRelation`. 
   Maybe we should also add a config similar to `fallBackToHdfs` in 
`HadoopFsRelation`? we just do real scans when the config is true. Otherwise, 
we just use the stats in CatalogTable to compute the `sizeInBytes`.
   Why I think we should move this size estimation to `HadoopFsRelation` for 
datasource table is that we can do better estimation for datasource table like 
parquet. see  [SPARK-30712](https://issues.apache.org/jira/browse/SPARK-30712)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to