[ 
https://issues.apache.org/jira/browse/IMPALA-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Rodoni closed IMPALA-6632.
-------------------------------
    Resolution: Won't Fix

> Document compatibility of table and column stats between Impala and Hive
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-6632
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6632
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Docs
>            Reporter: Alexander Behm
>            Assignee: Alex Rodoni
>            Priority: Major
>
> The question of compatibility between the table and column stats between Hive 
> and Impala comes up quite often, so is worth documenting explicitly.
> Quoting myself from a recent discussion thread to get the docs effort started:
> Commonalities:
> - Hive and Impala both store row counts at the table level and partition 
> level. Hive also computes and stores additional stats like file counts which 
> Impala does not need or use.
> Differences:
> - Impala computes and stores column-level stats like the number of distinct 
> values (NDV) only at the table level, and not at the partition level.
> - Hive computes and stores column-level stats at the partition level. Impala 
> does not follow this approach because the per-partition NDVs cannot be 
> sensibly combined for queries that access multiple partitions. In short, the 
> column stats for partitioned tables are not compatible between Impala and 
> Hive (because imo Hive's approach does not make sense).
> - Impala uses a more modern and tuned algorithm (HyperLogLog++) for 
> estimating the number of distinct values, so they tend to be more accurate 
> than Hive's. Your mileage may vary.
> - For unpartitioned tables, the Hive and Impala column stats are compatible.
> For partitioned tables, the table-level column stats that Impala writes in 
> the Metastore are stored just like for unpartitioned tables. These statistics 
> are "available" to Hive in the sense that the standard retrieval APIs will 
> work as expected.  My understanding is that for partitioned tables, Hive does 
> not use the table-level column stats, but instead expects partition-level 
> column stats. As I've said before, these partition-level column stats do not 
> make any sense because it is not possible to sensibly combine them for 
> multiple partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to