[
https://issues.apache.org/jira/browse/IMPALA-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-3976.
-----------------------------------
Resolution: Won't Fix
It looks like this is finally being fixed in Hive, so I don't think we
necessarily want to invest effort in trying to work around it in Impala
> Handle partition-key values with multiple synonymous string representations
> created in Hive.
> --------------------------------------------------------------------------------------------
>
> Key: IMPALA-3976
> URL: https://issues.apache.org/jira/browse/IMPALA-3976
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Affects Versions: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0,
> Impala 2.7.0
> Reporter: Alexander Behm
> Priority: Major
> Labels: correctness, incompatibility
>
> For several SQL statements that can create new partitions, Hive seems to
> generate partition-key values and the corresponding HDFS directory based on
> the user's string input rather than the corresponding literal value of the
> appropriate column type. This leads to a situation where a single logical
> partition-key value can map to multiple HDFS directories and Hive partitions.
> Example in Hive:
> {code}
> CREATE TABLE t (i INT) PARTITIONED BY (p INT);
> ALTER TABLE t ADD PARTITION (p=0);
> ALTER TABLE t ADD PARTITION (p=00);
> ALTER TABLE t ADD PARTITION (p=000);
> SHOW PARTITIONS t;
> p=0
> p=00
> p=000
> {code}
> The above statements will result in three different HDFS directories, one for
> each of the "distinct" partitions.
> The same result can be achieved with static partition inserts from Hive,
> instead of ALTER TABLE ADD PARTITION.
> Note that Impala will a canonical representation for any partition-key value
> based on the underlying LiteralExpr, so a similarly strange metadata state
> cannot be created from Impala, even if given the same input as in the example
> above.
> A special case of this issue was reported in HIVE-6590 and IMPALA-3963, but
> the underlying problem is more general.
> *Issues in Impala*
> Impala has difficulties dealing with such ambiguous partitions due to the
> internal assumption that a single assignment of values to partition keys maps
> to a single Hive partition with a one corresponding HDFS directory.
> As long as the cached partition metadata in Impala is correct, queries will
> return correct results even with partition filters. Impala effectively
> coalesces the different partition variants, for example, SELECT * FROM t
> WHERE p=0 will scan all three directories from the example above.
> The following statements are known have problems in Impala if such ambiguous
> partitions exist:
> * REFRESH <table> and REFRESH <partition>. After such a statement Impala may
> duplicate and/or missing partitions, leading to incorrect query results.
> * ALTER TABLE RECOVER PARTITIONS, same as REFRESH above.
> * ALTER TABLE <table> DROP PARTITIONS. Impala will only be able to drop the
> one partition with the the canonical value representation. Other variants of
> the same partition cannot be dropped.
> * Any other ALTER TABLE ... PARTITION(). Impala will only modify the one
> partition with the canonical value representation (if any).
> * It is safest to assume that all other metadata statements that operate on a
> single partition are likewise not functioning as intended.
> *Workarounds*
> * Ensure that partitions created via Hive do not exhibit ambiguity. Stick to
> a single partition-key value representation, e.g., use p=0 consistently and
> avoid variants like p=000.
> * Avoid those statements in Hive that can create the bad metadata. Always use
> fully dynamic partition inserts and avoid adding partitions via static
> partition inserts or ALTER TABLE.
> * Running INVALIDATE METADATA <table> will bring Impala's metadata back into
> a consistent state (including all partition variants). Queries will return
> correct results, but some DDL operations may still not fully work (like DROP
> PARTITION).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]