[Impala-ASF-CR] IMPALA-10164: Supporting HadoopCatalog for Iceberg table

Zoltan Borok-Nagy (Code Review) Fri, 25 Sep 2020 03:05:08 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16446 )


Change subject: IMPALA-10164: Supporting HadoopCatalog for Iceberg table
......................................................................


Patch Set 13:

Hi WangSheng, thank you for your reply.

I think we should choose the option that causes the least confusions.
With the current solution the users can create two tables in the following way:

  CREATE TABLE ice_1 (i int) STORED AS ICEBERG LOCATION 
'hdfs://test-warehouse/ice-catalog';
  CREATE TABLE ice_2 (s string) STORED AS ICEBERG LOCATION 
'hdfs://test-warehouse/ice-catalog';

ice_1 and ice_2 will be created in the same hadoop catalog, but they can 
contain their own data under their own (implicit) table location.
I think it's a valid use case as iceberg supports creating multiple tables in 
the same catalog. Moreover, it's probably how Iceberg catalogs are meant to be 
used.
Now if the user DROPs one of them, HMS will remove the whole catalog, possibly 
causing an unintended data loss.

Note that this case is different than having two managed PARQUET tables based 
on same location, because in that case the tables point to the same data. Also 
I cannot think of a use case when users should do that.

I propose the following: we should introduce a new table property: 
'iceberg.catalog_location'.

So users would create tables with the following statement if they want to use 
hadoop catalog:

  CREATE TABLE ice_t (i int)
  STORED AS ICEBERG
  TBLPROPERTIES('iceberg.catalog'='hadoop_catalog',
                'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog');

In that case it would be quite explicit what's happening. And we'd set the 
table's location to what Iceberg computes for the table (from <Catalog 
location> + <table identifier>).
We probably even want to prohibit explicitly setting the table LOCATION (so 
SHOW CREATE TABLE shouldn't include it either) when using hadoop catalog.

So DROP TABLE wouldn't affect other tables, and DESCRIBE FORMATTED would 
automatically show the actual table LOCATION and 'iceberg.catalog_location' 
(since it's a table property).
If we DROP all the tables in an iceberg catalog, then the empty catalog 
directory will still remain, but I don't see that as a serious issue.

Tables created via HadoopTables are not affected, i.e. they continue to work 
like they already work.

What do you think about this approach?


--
To view, visit http://gerrit.cloudera.org:8080/16446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef
Gerrit-Change-Number: 16446
Gerrit-PatchSet: 13
Gerrit-Owner: wangsheng <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: wangsheng <[email protected]>
Gerrit-Comment-Date: Fri, 25 Sep 2020 10:04:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10164: Supporting HadoopCatalog for Iceberg table

Reply via email to