[
https://issues.apache.org/jira/browse/IMPALA-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Rodoni closed IMPALA-6632.
-------------------------------
Resolution: Won't Fix
> Document compatibility of table and column stats between Impala and Hive
> ------------------------------------------------------------------------
>
> Key: IMPALA-6632
> URL: https://issues.apache.org/jira/browse/IMPALA-6632
> Project: IMPALA
> Issue Type: Improvement
> Components: Docs
> Reporter: Alexander Behm
> Assignee: Alex Rodoni
> Priority: Major
>
> The question of compatibility between the table and column stats between Hive
> and Impala comes up quite often, so is worth documenting explicitly.
> Quoting myself from a recent discussion thread to get the docs effort started:
> Commonalities:
> - Hive and Impala both store row counts at the table level and partition
> level. Hive also computes and stores additional stats like file counts which
> Impala does not need or use.
> Differences:
> - Impala computes and stores column-level stats like the number of distinct
> values (NDV) only at the table level, and not at the partition level.
> - Hive computes and stores column-level stats at the partition level. Impala
> does not follow this approach because the per-partition NDVs cannot be
> sensibly combined for queries that access multiple partitions. In short, the
> column stats for partitioned tables are not compatible between Impala and
> Hive (because imo Hive's approach does not make sense).
> - Impala uses a more modern and tuned algorithm (HyperLogLog++) for
> estimating the number of distinct values, so they tend to be more accurate
> than Hive's. Your mileage may vary.
> - For unpartitioned tables, the Hive and Impala column stats are compatible.
> For partitioned tables, the table-level column stats that Impala writes in
> the Metastore are stored just like for unpartitioned tables. These statistics
> are "available" to Hive in the sense that the standard retrieval APIs will
> work as expected. My understanding is that for partitioned tables, Hive does
> not use the table-level column stats, but instead expects partition-level
> column stats. As I've said before, these partition-level column stats do not
> make any sense because it is not possible to sensibly combine them for
> multiple partitions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)