Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16721 )
Change subject: IMPALA-10152: Add support for Iceberg HiveCatalog ...................................................................... Patch Set 2: (16 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/16721/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16721/1//COMMIT_MSG@11 PS1, Line 11: when the table data is stored in object stores like S3. > Just curios - is this related to eventual consistency? If yes, then I think Iceberg requires that the underlying filesystem supports atomic renames. I'm not sure if S3Guard solves that. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@606 PS1, Line 606: TIcebergCatalog catalog; > Can you add a comment about HIVE_CATALOG being the default here? Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java File fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java: http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@65 PS1, Line 65: return hiveCatalog_.createTable(identifier, schema, spec, location, > remove comment Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@73 PS1, Line 73: TableIdentifier tableId = IcebergUtil.getIcebergTableIdentifier(feTable); > nit: +2 indent Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@81 PS1, Line 81: try { > Can we check for tableLocation==null too? No, in Iceberg util we pass both tableId and tableLocation to make the code simpler. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@83 PS1, Line 83: } catch (Exception e) { > I am not 100% sure, but I think it would be better to catch all exceptions I wrapped them into TableLoadingException. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergHiveCatalog.java@93 PS1, Line 93: TableIdentifier tableId = IcebergUtil.getIcebergTableIdentifier(feTable); > nit: +2 indent Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1870 PS1, Line 1870: Iceberg' > Iceberg Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1960 PS1, Line 1960: > now this is needed for Iceberg tables too Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1976 PS1, Line 1976: throw new CatalogException(errorMsg); : } : : // Retrieve the HMS table to determine if this is a Kudu or Iceberg table. : org.apache.hadoop.hive.metastore.api.Table msTbl = existingTbl.getMetaStoreTable(); : if (msTbl == null) { : Preconditions.checkState(existingTbl instanceof IncompleteTable); : Stopwatch hmsLoadSW = Stopwatch.createStarted(); : long hmsLoadTime; > These codes seems similar, can we extract to a method? I don't think I can do that without some additional refactorings. If I had moved isSynchronizedTable() from KuduTable and IcebergTable to Table, I would still need to branch based on 'isKuduTable()/isIcebergTable()' because KuduCatalogOpexecutor and IcebergCatalogOpExecutor doesn't have a common base class. I don't want to do too much refactorings in the context of this patch, so I might just leave it as it is. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1985 PS1, Line 1985: y (MetaStoreClient msClient = catalog_.g > This case (existingTbl instanceof IncompleteTable && isSynchronizedIcebergT Synchronized table doesn't mean that the table is stored in HiveCatalog. It means that the 'external.table.purge' property is true. But the Iceberg table might be stored in HadoopTables or HadoopCatalog. An Iceberg table is incomplete if we couldn't load it via the Iceberg API, therefore we cannot execute Iceberg DROP TABLE. existingTbl instanceof IncompleteTable && isSynchronizedIcebergTable == true is quite of an edge case, but it can happen when the underlying directory is deleted outside of Impala. We have this test for this case: https://github.com/apache/impala/blob/master/tests/query_test/test_iceberg.py#L45 If the table is synchronized and is stored in HiveCatalog, and gets deleted outside of Impala, then I think Impala just needs to refresh its metadata cache. Anyway, now I'm also dropping incomplete tables via HMS. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1985 PS1, Line 1985: try (MetaStoreClient msClient = catalog_.getMetaStoreClient()) { : msTbl = msClient.getHiveClie > One line is ok, unnecessary to two lines. Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1991 PS1, Line 1991: hmsLoadTime = hmsLoadSW.elapsed(TimeUnit.NANOSECONDS); : } : existingTbl.updateHMSLoadTableSchemaTime(hmsLoadTime); : } : boolean isSynchronizedKuduTable = msTbl != null && : KuduTable.isKuduTable(msTbl) && KuduTable.isSynchronizedTable(msTbl); : if (isSynchronizedKuduTable) { : KuduCatalogOpExecutor.dropTable(msTbl, /* if exists */ true); : } : : boolean isSynchronizedIcebergTable = msTbl != null && : IcebergTable.isIcebergTable(msTbl) && : IcebergTable.isSynchronizedTable(msTbl); : if (!(existingTbl instanceof IncompleteTable) && isSynchronizedIcebergTable) { : Preconditions.checkState(existingTbl instanceof IcebergTable); : > nice to have: IMO it would be better to move this before Kudu Iceberg handl Done http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@2011 PS1, Line 2011: boolean isSynchronizedTable = isSynchronizedKuduTable || isSynchronizedIcebergTable; > I think that something like the old implementation of alterTable would be c I introduced a new variable to make the code more verbose. http://gerrit.cloudera.org:8080/#/c/16721/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3343 PS1, Line 3343: e tableName = oldT > I think the old bool is clearer (maybe a name like needsHmsAlterTable would Introduced a new variable with that name. http://gerrit.cloudera.org:8080/#/c/16721/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test: http://gerrit.cloudera.org:8080/#/c/16721/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@268 PS1, Line 268: ==== > add test with custom table location Done -- To view, visit http://gerrit.cloudera.org:8080/16721 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197 Gerrit-Change-Number: 16721 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: wangsheng <[email protected]> Gerrit-Comment-Date: Wed, 18 Nov 2020 10:24:15 +0000 Gerrit-HasComments: Yes
