[Impala-ASF-CR] IMPALA-10152: Add support for Iceberg HiveCatalog

Zoltan Borok-Nagy (Code Review) Wed, 18 Nov 2020 02:24:35 -0800

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16721 )


Change subject: IMPALA-10152: Add support for Iceberg HiveCatalog
......................................................................


Patch Set 2:

(16 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/16721/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16721/1//COMMIT_MSG@11
PS1, Line 11: when the table data is stored in object stores like S3.
> Just curios - is this related to eventual consistency? If yes, then I think
Iceberg requires that the underlying filesystem supports atomic renames. I'm 
not sure if S3Guard solves that.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@606
PS1, Line 606:     TIcebergCatalog catalog;
> Can you add a comment about HIVE_CATALOG being the default here?
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java
File fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java:

http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@65
PS1, Line 65:     return hiveCatalog_.createTable(identifier, schema, spec, 
location,
> remove comment
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@73
PS1, Line 73:     TableIdentifier tableId = 
IcebergUtil.getIcebergTableIdentifier(feTable);
> nit: +2 indent
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@81
PS1, Line 81:     try {
> Can we check for tableLocation==null too?
No, in Iceberg util we pass both tableId and tableLocation to make the code 
simpler.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@83
PS1, Line 83:     } catch (Exception e) {
> I am not 100% sure, but I think it would be better to catch all exceptions
I wrapped them into TableLoadingException.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@93
PS1, Line 93:     TableIdentifier tableId = 
IcebergUtil.getIcebergTableIdentifier(feTable);
> nit: +2 indent
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1870
PS1, Line 1870: Iceberg'
> Iceberg
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1960
PS1, Line 1960:
> now this is needed for Iceberg tables too
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1976
PS1, Line 1976:   throw new CatalogException(errorMsg);
              :       }
              :
              :       // Retrieve the HMS table to determine if this is a Kudu 
or Iceberg table.
              :       org.apache.hadoop.hive.metastore.api.Table msTbl = 
existingTbl.getMetaStoreTable();
              :       if (msTbl == null) {
              :         Preconditions.checkState(existingTbl instanceof 
IncompleteTable);
              :         Stopwatch hmsLoadSW = Stopwatch.createStarted();
              :         long hmsLoadTime;
> These codes seems similar, can we extract to a method?
I don't think I can do that without some additional refactorings. If I had 
moved isSynchronizedTable() from KuduTable and IcebergTable to Table, I would 
still need to branch based on 'isKuduTable()/isIcebergTable()' because 
KuduCatalogOpexecutor and IcebergCatalogOpExecutor doesn't have a common base 
class. I don't want to do too much refactorings in the context of this patch, 
so I might just leave it as it is.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1985
PS1, Line 1985: y (MetaStoreClient msClient = catalog_.g
> This case (existingTbl instanceof IncompleteTable && isSynchronizedIcebergT
Synchronized table doesn't mean that the table is stored in HiveCatalog. It 
means that the 'external.table.purge' property is true. But the Iceberg table 
might be stored in HadoopTables or HadoopCatalog.

An Iceberg table is incomplete if we couldn't load it via the Iceberg API, 
therefore we cannot execute Iceberg DROP TABLE.

existingTbl instanceof IncompleteTable && isSynchronizedIcebergTable == true is 
quite of an edge case, but it can happen when the underlying directory is 
deleted outside of Impala. We have this test for this case: 
https://github.com/apache/impala/blob/master/tests/query_test/test_iceberg.py#L45

If the table is synchronized and is stored in HiveCatalog, and gets deleted 
outside of Impala, then I think Impala just needs to refresh its metadata 
cache. Anyway, now I'm also dropping incomplete tables via HMS.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1985
PS1, Line 1985:   try (MetaStoreClient msClient = 
catalog_.getMetaStoreClient()) {
              :           msTbl = msClient.getHiveClie
> One line is ok, unnecessary to two lines.
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1991
PS1, Line 1991:           hmsLoadTime = hmsLoadSW.elapsed(TimeUnit.NANOSECONDS);
              :         }
              :         existingTbl.updateHMSLoadTableSchemaTime(hmsLoadTime);
              :       }
              :       boolean isSynchronizedKuduTable = msTbl != null &&
              :               KuduTable.isKuduTable(msTbl) && 
KuduTable.isSynchronizedTable(msTbl);
              :       if (isSynchronizedKuduTable) {
              :         KuduCatalogOpExecutor.dropTable(msTbl, /* if exists */ 
true);
              :       }
              :
              :       boolean isSynchronizedIcebergTable = msTbl != null &&
              :           IcebergTable.isIcebergTable(msTbl) &&
              :           IcebergTable.isSynchronizedTable(msTbl);
              :       if (!(existingTbl instanceof IncompleteTable) && 
isSynchronizedIcebergTable) {
              :         Preconditions.checkState(existingTbl instanceof 
IcebergTable);
              :
> nice to have: IMO it would be better to move this before Kudu Iceberg handl
Done


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@2011
PS1, Line 2011:       boolean isSynchronizedTable = isSynchronizedKuduTable || 
isSynchronizedIcebergTable;
> I think that something like the old implementation of alterTable would be c
I introduced a new variable to make the code more verbose.


http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3343
PS1, Line 3343: e tableName = oldT
> I think the old bool is clearer (maybe a name like needsHmsAlterTable would
Introduced a new variable with that name.


http://gerrit.cloudera.org:8080/#/c/16721/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test:

http://gerrit.cloudera.org:8080/#/c/16721/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@268
PS1, Line 268: ====
> add test with custom table location
Done



--
To view, visit http://gerrit.cloudera.org:8080/16721
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197
Gerrit-Change-Number: 16721
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: wangsheng <[email protected]>
Gerrit-Comment-Date: Wed, 18 Nov 2020 10:24:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10152: Add support for Iceberg HiveCatalog

Reply via email to