[ 
https://issues.apache.org/jira/browse/IMPALA-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-3976.
-----------------------------------
    Resolution: Won't Fix

It looks like this is finally being fixed in Hive, so I don't think we 
necessarily want to invest effort in trying to work around it in Impala

> Handle partition-key values with multiple synonymous string representations 
> created in Hive.
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3976
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3976
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, 
> Impala 2.7.0
>            Reporter: Alexander Behm
>            Priority: Major
>              Labels: correctness, incompatibility
>
> For several SQL statements that can create new partitions, Hive seems to 
> generate partition-key values and the corresponding HDFS directory based on 
> the user's string input rather than the corresponding literal value of the 
> appropriate column type. This leads to a situation where a single logical 
> partition-key value can map to multiple HDFS directories and Hive partitions.
> Example in Hive:
> {code}
> CREATE TABLE t (i INT) PARTITIONED BY (p INT);
> ALTER TABLE t ADD PARTITION (p=0);
> ALTER TABLE t ADD PARTITION (p=00);
> ALTER TABLE t ADD PARTITION (p=000);
> SHOW PARTITIONS t;
> p=0
> p=00
> p=000
> {code}
> The above statements will result in three different HDFS directories, one for 
> each of the "distinct" partitions.
> The same result can be achieved with static partition inserts from Hive, 
> instead of ALTER TABLE ADD PARTITION.
> Note that Impala will a canonical representation for any partition-key value 
> based on the underlying LiteralExpr, so a similarly strange metadata state 
> cannot be created from Impala, even if given the same input as in the example 
> above.
> A special case of this issue was reported in HIVE-6590 and IMPALA-3963, but 
> the underlying problem is more general.
> *Issues in Impala*
> Impala has difficulties dealing with such ambiguous partitions due to the 
> internal assumption that a single assignment of values to partition keys maps 
> to a single Hive partition with a one corresponding HDFS directory.
> As long as the cached partition metadata in Impala is correct, queries will 
> return correct results even with partition filters. Impala effectively 
> coalesces the different partition variants, for example, SELECT * FROM t 
> WHERE p=0 will scan all three directories from the example above.
> The following statements are known have problems in Impala if such ambiguous 
> partitions exist:
> * REFRESH <table> and REFRESH <partition>. After such a statement Impala may 
> duplicate and/or missing partitions, leading to incorrect query results.
> * ALTER TABLE RECOVER PARTITIONS, same as REFRESH above.
> * ALTER TABLE <table> DROP PARTITIONS. Impala will only be able to drop the 
> one partition with the the canonical value representation. Other variants of 
> the same partition cannot be dropped.
> * Any other ALTER TABLE ... PARTITION(). Impala will only modify the one 
> partition with the canonical value representation (if any).
> * It is safest to assume that all other metadata statements that operate on a 
> single partition are likewise not functioning as intended.
> *Workarounds*
> * Ensure that partitions created via Hive do not exhibit ambiguity. Stick to 
> a single partition-key value representation, e.g., use p=0 consistently and 
> avoid variants like p=000.
> * Avoid those statements in Hive that can create the bad metadata. Always use 
> fully dynamic partition inserts and avoid adding partitions via static 
> partition inserts or ALTER TABLE.
> * Running INVALIDATE METADATA <table> will bring Impala's metadata back into 
> a consistent state (including all partition variants). Queries will return 
> correct results, but some DDL operations may still not fully work (like DROP 
> PARTITION).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to