[
https://issues.apache.org/jira/browse/SPARK-30394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-30394:
------------------------------------
Assignee: (was: Apache Spark)
> Skip collecting stats in DetermineTableStats rule when hive table is
> convertible to datasource tables
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-30394
> URL: https://issues.apache.org/jira/browse/SPARK-30394
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: liupengcheng
> Priority: Major
>
> Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark
> will scan hdfs files to collect table stats in `DetermineTableStats` rule.
> But this can be expensive and not accurate(only file size on disk, not
> accounting compression factor), acutually we can skip this if this hive table
> can be converted to datasource table(parquet etc.), and do better estimation
> in `HadoopFsRelation`.
> BeforeSPARK-28573, the implementation will update the CatalogTableStatistics,
> which will cause the improper stats(for parquet, this size is greatly smaller
> than real size in memory) be used in joinSelection when the hive table can be
> convert to datasource table.
> In our production environment, user's highly compressed parquet table can
> cause OOMs when doing `broadcastHashJoin` due to this improper stats.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]