liupc commented on a change in pull request #27055: [SPARK-30394]Skip
DetermineTableStats rule when hive table can be converted to datasource table
URL: https://github.com/apache/spark/pull/27055#discussion_r375872617
##########
File path:
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##########
@@ -139,13 +139,15 @@ class DetermineTableStats(session: SparkSession) extends
Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case relation: HiveTableRelation
- if DDLUtils.isHiveTable(relation.tableMeta) &&
relation.tableMeta.stats.isEmpty =>
+ if DDLUtils.isHiveTable(relation.tableMeta) &&
relation.tableMeta.stats.isEmpty &&
+ !RelationConversions.isConvertible(relation) =>
Review comment:
@viirya I agree that doing size estimation on demand and disregarding
catalog statistics is expensive, what I actually want to do is skip the
`fallBackToHdfs` code path for datasource tables and do size estimation in
`HadoopFsRelation`.
Maybe we should also add a config similar to `fallBackToHdfs` in
`HadoopFsRelation`? we just do real scans when the config is true. Otherwise,
we just use the stats in CatalogTable to compute the `sizeInBytes`.
Why I think we should move this size estimation to `HadoopFsRelation` for
datasource table is that we can do better estimation for datasource table like
parquet. see [SPARK-30712](https://issues.apache.org/jira/browse/SPARK-30712)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]