[ 
https://issues.apache.org/jira/browse/HIVE-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512267#comment-16512267
 ] 

Gabor Kaszab commented on HIVE-19830:
-------------------------------------

Thanks for your explanation [~sershe]!
 I have the feeling that since people use this technique in production it would 
make sense to start a conversation of how to make this supported in some way 
even if it's an abuse of the available functionalities. And I would also 
include dropping these special partitions in this conversation. If it ends up 
as a feature request it's totally fine for me.


 +Showing data when multiple partitions share the same location:+
 Here I see 2 approaches:

1) Show the same data as many times as the number of partitions we have 
pointing on that location. In addition, a sum() should also take this 
multiplication into account. 

2) Show a particular line of data only once no matter how many partitions share 
this same data. sum() should also follow this approach. It's also an 
interesting question which partition should we pick in this case to show e.g. 
for a 'select *'. I have no preference here other than making this 
deterministic.

 

+Dropping partitons when multiple partitions share the same location+

Here I think it's fine to drop the folder for the partition. Comparing to the 
current behaviour my only ask here would be to don't show the other partitions 
pointing to the deleted folder to appear as valid partitions e.g. when a 'show 
partitions' is invoked or any other system asks the list of valid partitions.

 

What do you think, does this make sense?

> Inconsistent behavior when multiple partitions point to the same location
> -------------------------------------------------------------------------
>
>                 Key: HIVE-19830
>                 URL: https://issues.apache.org/jira/browse/HIVE-19830
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.4.0
>            Reporter: Gabor Kaszab
>            Assignee: Adam Szita
>            Priority: Major
>
> // create a table with 2 partitions where both partitions share the same 
> location and inserting a single line to one of them.
> create table test (i int) partitioned by (j int) stored as parquet;
> alter table test add partition (j=1) location 
> 'hdfs://localhost:20500/test-warehouse/test/j=1';
> alter table test add partition (j=2) location 
> 'hdfs://localhost:20500/test-warehouse/test/j=1';
> insert into table test partition (j=1) values (1);
> // select * show this single line in both partitions as expected.
> select * from test;
> 1 1
> 1 2
> // however, sum() doesn't add up the line for all the partitions. This is 
> +Issue #1+.
> select sum( i), sum(j) from test;
> 1 2
> // On the file system there is a common dir for the 2 partitions that is 
> expected.
> hdfs dfs -ls hdfs://localhost:20500/test-warehouse/test/
> Found 1 items
> drwxr-xr-x - gaborkaszab supergroup 0 2018-06-08 10:54 
> hdfs://localhost:20500/test-warehouse/test/j=1
> // Let's drop one of the partitions now!
> alter table test drop partition (j=2);
> // running the same hdfs dfs -ls command shows that the j=1 directory is 
> dropped. I think this is a good behavior, we just have to document that this 
> is the expected case.
> // select * from test; returns zero rows, this is still as expected.
> // Even though the dir is dropped j=1 partition is still visible with show 
> partitions. This is +Issue #2+.
> show partitions test;
> j=1
> After dropping the directory with Hive, when Impala reloads it's partitions 
> it asks Hive to tell what are the existing partitions. Apparently, Hive sends 
> down a list with j=1 partition included and then Impala takes it as an 
> existing one and doesn't drop it from Catalog's cache. Here Hive shouldn't 
> send that partition down. This is +Issue #3+.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to