Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15797 )
Change subject: IMPALA-9688: Support create iceberg table by impala ...................................................................... Patch Set 22: (11 comments) Thanks for bearing with me. I left some comments below. The patch is good to go from my side if you can address those in this patch (or create follow up JIRAs for addressing them). http://gerrit.cloudera.org:8080/#/c/15797/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15797/21//COMMIT_MSG@11 PS21, Line 11: create table > Yes, iceberg supported some dml operation as well. I'm working on IMPALA-97 thanks for the clarification. http://gerrit.cloudera.org:8080/#/c/15797/21//COMMIT_MSG@18 PS21, Line 18: partition by spec( : level identity, : event_time identity, : event_time hour, : register_time day : ) > 1. I refer to the syntax of kudu table DDl: partition by hash/range, and I Thanks. I noticed that Presto has some support for Iceberg as per https://prestodb.io/docs/current/release/release-0.123.html but I am not sure how different or similar it is with this. http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/cup/sql-parser.cup@1607 PS21, Line 1607: createIcebergPartitionedLayout What happens when an invalid transform is provided in the input? This method throws a TableLoadingException which is a bit weird to see during the Parsing stage. In case of Kudu we don't have that problem since the KW_HASH and KW_RANGE have specific keywords defined. http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@256 PS21, Line 256: // schema. Likewise for external Kudu tables, the schema can be read from Kudu. : if (getColumnDefs().isEmpty() && getFileFormat() != THdfsFileFormat.AVRO : && getFileFormat() != THdfsFileFormat.KUDU) { : Can Iceberg tables have 0 columns? If not, may be add a check here too for Iceberg. http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@143 PS22, Line 143: getIcebergPartitionSpecs Is it possible that the user creates a iceberg table without providing a partition by spec clause? What is the behavior in this case for table loading code? I think it would great if you could add some positive and negative tests similar to AnalyzeKuduDDLTest (I think its okay to do it as a follow up in separate JIRA). http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@268 PS22, Line 268: } : : if (getFileFormat() == THdfsFileFormat.ICEBERG) { : analyzeIcebergFormat(); : } If you look at how Kudu table analysis is implemented, tableDef_.analyze(analyzer) method in line 253 makes sure that the if there are any Kudu options used, the table is indeed a Kudu table. So may be we should do it here as well for Iceberg specific options. Also, I don't see the IcebergPartitionSpec itself getting analyzed anywhere. For example, if there are any column types which are not supported we should throw analysis exception if user tries to provide such columns. Also, we should make sure that the partition spec column names exist in the table column names if that is disallowed. I think it will be cleaner if we call IcebergPartitionSpec.analyze() method here so that we can plugin any checks in IcebergPartitionField during the createTableStmt analysis phase. So may be add something like for the sake of completeness. for (IcebergPartitionSpec partitionSpec : getIceBergPartitionSpecs) { partitionSpec.analyze(); } http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@579 PS22, Line 579: putGeneratedKuduProperty Can you please rename this to putGeneratedProperty since this method is shared by Kudu and Iceberg now? http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/IcebergPartitionField.java File fe/src/main/java/org/apache/impala/analysis/IcebergPartitionField.java: http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/IcebergPartitionField.java@28 PS22, Line 28: //Column source id from PartitionField : private int sourceId_; : : //Column id from iceberg PartitionField : private int fieldId_; I think it would be great if we add more documentation about these fields and how they are used by Iceberg. Specifically, I see that during the parsing stage, we always create IcebergPartitionField with the sourceId and fieldId as 0,0. Do these ids ever change? http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java File fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java: http://gerrit.cloudera.org:8080/#/c/15797/22/fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java@24 PS22, Line 24: PARTITION BY and PARTITIONED BY Can you update this javadoc to include that this is used by Iceberg's partition by spec also? http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@231 PS21, Line 231: for (PartitionField field : spec.fields()) { : fields.add(new IcebergPartitionField(field.sourceId(), field.fieldId(), : field.name(), IcebergUtil.getPartitionTransform(field))); : } Do we need to make sure that the field id or name exists in the table column spec? For instance if a table is created as create table iceberg_test( level string, event_time timestamp, ) partition by spec( level identity, event_time hour, ) stored as iceberg; Can the specs here include a name which is not in (level, event_time)? http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/15797/21/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@2547 PS21, Line 2547: msClient.getHiveClient().createTable(newTable); > Yes, you are right, iceberg table is always unpartitioned in hms. I don't u What I meant was to add a comment here saying that "Iceberg tables are always unpartitioned. The partition columns are derived from the partition spec is derived from The TCreateTableParams field and could include one or more of the table columns". Also, may be make sure to add Preconditions.checkState(newTable.getPartitionKeys() == null || newTable.getPartitionKeys().isEmpty()) as a sanity check. -- To view, visit http://gerrit.cloudera.org:8080/15797 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8d85db4c904a8c758c4cfb4f19cfbdab7e6ea284 Gerrit-Change-Number: 15797 Gerrit-PatchSet: 22 Gerrit-Owner: wangsheng <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: wangsheng <[email protected]> Gerrit-Comment-Date: Mon, 15 Jun 2020 19:40:02 +0000 Gerrit-HasComments: Yes
