wangsheng has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16446 )

Change subject: IMPALA-10164: Supporting HadoopCatalog for Iceberg table
......................................................................


Patch Set 13:

> Hi WangSheng, thank you for your reply.
 >
 > I think we should choose the option that causes the least
 > confusions.
 > With the current solution the users can create two tables in the
 > following way:
 >
 > CREATE TABLE ice_1 (i int) STORED AS ICEBERG LOCATION
 > 'hdfs://test-warehouse/ice-catalog';
 > CREATE TABLE ice_2 (s string) STORED AS ICEBERG LOCATION
 > 'hdfs://test-warehouse/ice-catalog';
 >
 > ice_1 and ice_2 will be created in the same hadoop catalog, but
 > they can contain their own data under their own (implicit) table
 > location.
 > I think it's a valid use case as iceberg supports creating multiple
 > tables in the same catalog. Moreover, it's probably how Iceberg
 > catalogs are meant to be used.
 > Now if the user DROPs one of them, HMS will remove the whole
 > catalog, possibly causing an unintended data loss.
 >
 > Note that this case is different than having two managed PARQUET
 > tables based on same location, because in that case the tables
 > point to the same data. Also I cannot think of a use case when
 > users should do that.
 >
 > I propose the following: we should introduce a new table property:
 > 'iceberg.catalog_location'.
 >
 > So users would create tables with the following statement if they
 > want to use hadoop catalog:
 >
 > CREATE TABLE ice_t (i int)
 > STORED AS ICEBERG
 > TBLPROPERTIES('iceberg.catalog'='hadoop_catalog',
 > 'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog');
 >
 > In that case it would be quite explicit what's happening. And we'd
 > set the table's location to what Iceberg computes for the table
 > (from <Catalog location> + <table identifier>).
 > We probably even want to prohibit explicitly setting the table
 > LOCATION (so SHOW CREATE TABLE shouldn't include it either) when
 > using hadoop catalog.
 >
 > So DROP TABLE wouldn't affect other tables, and DESCRIBE FORMATTED
 > would automatically show the actual table LOCATION and
 > 'iceberg.catalog_location' (since it's a table property).
 > If we DROP all the tables in an iceberg catalog, then the empty
 > catalog directory will still remain, but I don't see that as a
 > serious issue.
 >
 > Tables created via HadoopTables are not affected, i.e. they
 > continue to work like they already work.
 >
 > What do you think about this approach?

Hi Zoltan, 'iceberg.catalog_location' is indeed a good approach, and here is 
some of my understanding for this property. For this table:
CREATE TABLE default.ice_t (i int)
  STORED AS ICEBERG
  TBLPROPERTIES('iceberg.catalog'='hadoop_catalog',
                'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog');
1. We prohibit explicitly setting the table LOCATION when using 'hadoop.catalog'
2. When execute 'SHOW CREATE TABLE', table location is 
'hdfs://test-warehouse/ice-catalog/default/ice_t'
3. When execute 'DESCRIBE FORMATTED', table location is 
'hdfs://test-warehouse/ice-catalog/default/ice_t', and 
'iceberg.catalog_location' is 'hdfs://test-warehouse/ice-catalog'
4. When execute 'DROP TABLE', we only delete 
'hdfs://test-warehouse/ice-catalog/default/ice_t', and reserve 
'hdfs://test-warehouse/ice-catalog'
5. 'iceberg.catalog_location' is necessary for both managed and external 
'hadoop.catalog' Iceberg table
6. When loading managed table, we use 
load('iceberg.catalog_location',TableIdentifier(db.tbl))
7. When loading external table, we use 
load('iceberg.catalog_location',TableIdentifier('iceberg.table_name'))

Does my understanding correct?

And still some questions:

1. When creating external 'hadoop.catalog' Iceberg table, are we load table by 
'iceberg.catalog_location' and 'iceberg.table_name'? If so, both these two 
properties are necessary.
2. Can we set the table LOCATION for 'hadoop.catalog' external table? If not, 
how do we define table LOCATION? Using 
'iceberg.catalog_location'+/db/external_tbl? Or use 
'iceberg.catalog_location'+'iceberg.table_name'?


--
To view, visit http://gerrit.cloudera.org:8080/16446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef
Gerrit-Change-Number: 16446
Gerrit-PatchSet: 13
Gerrit-Owner: wangsheng <sky...@163.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: wangsheng <sky...@163.com>
Gerrit-Comment-Date: Fri, 25 Sep 2020 14:25:22 +0000
Gerrit-HasComments: No

Reply via email to