wangsheng has posted comments on this change. (
http://gerrit.cloudera.org:8080/16446 )
Change subject: IMPALA-10164: Supporting HadoopCatalog for Iceberg table
......................................................................
Patch Set 13:
> Hi WangSheng, thank you for your reply.
>
> I think we should choose the option that causes the least
> confusions.
> With the current solution the users can create two tables in the
> following way:
>
> CREATE TABLE ice_1 (i int) STORED AS ICEBERG LOCATION
> 'hdfs://test-warehouse/ice-catalog';
> CREATE TABLE ice_2 (s string) STORED AS ICEBERG LOCATION
> 'hdfs://test-warehouse/ice-catalog';
>
> ice_1 and ice_2 will be created in the same hadoop catalog, but
> they can contain their own data under their own (implicit) table
> location.
> I think it's a valid use case as iceberg supports creating multiple
> tables in the same catalog. Moreover, it's probably how Iceberg
> catalogs are meant to be used.
> Now if the user DROPs one of them, HMS will remove the whole
> catalog, possibly causing an unintended data loss.
>
> Note that this case is different than having two managed PARQUET
> tables based on same location, because in that case the tables
> point to the same data. Also I cannot think of a use case when
> users should do that.
>
> I propose the following: we should introduce a new table property:
> 'iceberg.catalog_location'.
>
> So users would create tables with the following statement if they
> want to use hadoop catalog:
>
> CREATE TABLE ice_t (i int)
> STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop_catalog',
> 'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog');
>
> In that case it would be quite explicit what's happening. And we'd
> set the table's location to what Iceberg computes for the table
> (from <Catalog location> + <table identifier>).
> We probably even want to prohibit explicitly setting the table
> LOCATION (so SHOW CREATE TABLE shouldn't include it either) when
> using hadoop catalog.
>
> So DROP TABLE wouldn't affect other tables, and DESCRIBE FORMATTED
> would automatically show the actual table LOCATION and
> 'iceberg.catalog_location' (since it's a table property).
> If we DROP all the tables in an iceberg catalog, then the empty
> catalog directory will still remain, but I don't see that as a
> serious issue.
>
> Tables created via HadoopTables are not affected, i.e. they
> continue to work like they already work.
>
> What do you think about this approach?
Hi Zoltan, 'iceberg.catalog_location' is indeed a good approach, and here is
some of my understanding for this property. For this table:
CREATE TABLE default.ice_t (i int)
STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop_catalog',
'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog');
1. We prohibit explicitly setting the table LOCATION when using 'hadoop.catalog'
2. When execute 'SHOW CREATE TABLE', table location is
'hdfs://test-warehouse/ice-catalog/default/ice_t'
3. When execute 'DESCRIBE FORMATTED', table location is
'hdfs://test-warehouse/ice-catalog/default/ice_t', and
'iceberg.catalog_location' is 'hdfs://test-warehouse/ice-catalog'
4. When execute 'DROP TABLE', we only delete
'hdfs://test-warehouse/ice-catalog/default/ice_t', and reserve
'hdfs://test-warehouse/ice-catalog'
5. 'iceberg.catalog_location' is necessary for both managed and external
'hadoop.catalog' Iceberg table
6. When loading managed table, we use
load('iceberg.catalog_location',TableIdentifier(db.tbl))
7. When loading external table, we use
load('iceberg.catalog_location',TableIdentifier('iceberg.table_name'))
Does my understanding correct?
And still some questions:
1. When creating external 'hadoop.catalog' Iceberg table, are we load table by
'iceberg.catalog_location' and 'iceberg.table_name'? If so, both these two
properties are necessary.
2. Can we set the table LOCATION for 'hadoop.catalog' external table? If not,
how do we define table LOCATION? Using
'iceberg.catalog_location'+/db/external_tbl? Or use
'iceberg.catalog_location'+'iceberg.table_name'?
--
To view, visit http://gerrit.cloudera.org:8080/16446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef
Gerrit-Change-Number: 16446
Gerrit-PatchSet: 13
Gerrit-Owner: wangsheng <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: wangsheng <[email protected]>
Gerrit-Comment-Date: Fri, 25 Sep 2020 14:25:22 +0000
Gerrit-HasComments: No