[
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Pachitariu updated HIVE-20523:
-------------------------------------
Summary: Improve table statistics for Parquet format (was: Improve table
statistics when the table contains arrays)
> Improve table statistics for Parquet format
> -------------------------------------------
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
> Issue Type: Improvement
> Components: Physical Optimizer
> Reporter: George Pachitariu
> Assignee: George Pachitariu
> Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize*
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays.
> This makes the table size be underestimated when arrays make most of the
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases
> (overestimating the size of tables).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)