wangsheng has posted comments on this change. ( http://gerrit.cloudera.org:8080/16446 )
Change subject: IMPALA-10164: Supporting HadoopCatalog for Iceberg table ...................................................................... Patch Set 13: > Hi WangSheng, thank you for your reply. > > I think we should choose the option that causes the least > confusions. > With the current solution the users can create two tables in the > following way: > > CREATE TABLE ice_1 (i int) STORED AS ICEBERG LOCATION > 'hdfs://test-warehouse/ice-catalog'; > CREATE TABLE ice_2 (s string) STORED AS ICEBERG LOCATION > 'hdfs://test-warehouse/ice-catalog'; > > ice_1 and ice_2 will be created in the same hadoop catalog, but > they can contain their own data under their own (implicit) table > location. > I think it's a valid use case as iceberg supports creating multiple > tables in the same catalog. Moreover, it's probably how Iceberg > catalogs are meant to be used. > Now if the user DROPs one of them, HMS will remove the whole > catalog, possibly causing an unintended data loss. > > Note that this case is different than having two managed PARQUET > tables based on same location, because in that case the tables > point to the same data. Also I cannot think of a use case when > users should do that. > > I propose the following: we should introduce a new table property: > 'iceberg.catalog_location'. > > So users would create tables with the following statement if they > want to use hadoop catalog: > > CREATE TABLE ice_t (i int) > STORED AS ICEBERG > TBLPROPERTIES('iceberg.catalog'='hadoop_catalog', > 'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog'); > > In that case it would be quite explicit what's happening. And we'd > set the table's location to what Iceberg computes for the table > (from <Catalog location> + <table identifier>). > We probably even want to prohibit explicitly setting the table > LOCATION (so SHOW CREATE TABLE shouldn't include it either) when > using hadoop catalog. > > So DROP TABLE wouldn't affect other tables, and DESCRIBE FORMATTED > would automatically show the actual table LOCATION and > 'iceberg.catalog_location' (since it's a table property). > If we DROP all the tables in an iceberg catalog, then the empty > catalog directory will still remain, but I don't see that as a > serious issue. > > Tables created via HadoopTables are not affected, i.e. they > continue to work like they already work. > > What do you think about this approach? Hi Zoltan, 'iceberg.catalog_location' is indeed a good approach, and here is some of my understanding for this property. For this table: CREATE TABLE default.ice_t (i int) STORED AS ICEBERG TBLPROPERTIES('iceberg.catalog'='hadoop_catalog', 'iceberg.catalog_location'='hdfs://test-warehouse/ice-catalog'); 1. We prohibit explicitly setting the table LOCATION when using 'hadoop.catalog' 2. When execute 'SHOW CREATE TABLE', table location is 'hdfs://test-warehouse/ice-catalog/default/ice_t' 3. When execute 'DESCRIBE FORMATTED', table location is 'hdfs://test-warehouse/ice-catalog/default/ice_t', and 'iceberg.catalog_location' is 'hdfs://test-warehouse/ice-catalog' 4. When execute 'DROP TABLE', we only delete 'hdfs://test-warehouse/ice-catalog/default/ice_t', and reserve 'hdfs://test-warehouse/ice-catalog' 5. 'iceberg.catalog_location' is necessary for both managed and external 'hadoop.catalog' Iceberg table 6. When loading managed table, we use load('iceberg.catalog_location',TableIdentifier(db.tbl)) 7. When loading external table, we use load('iceberg.catalog_location',TableIdentifier('iceberg.table_name')) Does my understanding correct? And still some questions: 1. When creating external 'hadoop.catalog' Iceberg table, are we load table by 'iceberg.catalog_location' and 'iceberg.table_name'? If so, both these two properties are necessary. 2. Can we set the table LOCATION for 'hadoop.catalog' external table? If not, how do we define table LOCATION? Using 'iceberg.catalog_location'+/db/external_tbl? Or use 'iceberg.catalog_location'+'iceberg.table_name'? -- To view, visit http://gerrit.cloudera.org:8080/16446 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef Gerrit-Change-Number: 16446 Gerrit-PatchSet: 13 Gerrit-Owner: wangsheng <sky...@163.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: wangsheng <sky...@163.com> Gerrit-Comment-Date: Fri, 25 Sep 2020 14:25:22 +0000 Gerrit-HasComments: No