[ 
https://issues.apache.org/jira/browse/SPARK-30394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30394:
------------------------------------

    Assignee:     (was: Apache Spark)

> Skip collecting stats in DetermineTableStats rule when hive table is 
> convertible to  datasource tables
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30394
>                 URL: https://issues.apache.org/jira/browse/SPARK-30394
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: liupengcheng
>            Priority: Major
>
> Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark 
> will scan hdfs files to collect table stats in `DetermineTableStats` rule. 
> But this can be expensive and not accurate(only file size on disk, not 
> accounting compression factor), acutually we can skip this if this hive table 
> can be converted to datasource table(parquet etc.), and do better estimation 
> in `HadoopFsRelation`.
> BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, 
> which will cause the improper stats(for parquet, this size is greatly smaller 
> than real size in memory) be used in joinSelection when the hive table can be 
> convert to datasource table.
> In our production environment, user's highly compressed parquet table can 
> cause OOMs when doing `broadcastHashJoin` due to this improper stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to