Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables ......................................................................
Patch Set 4: (54 comments) http://gerrit.cloudera.org:8080/#/c/4414/4/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: Line 53: enum THdfsFileFormat { > rename This change would touch many places. Would you mind postponing it for a follow up patch? http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: Line 976: tbl_def_without_col_defs:tbl_def > a 'create table' without col defs? This was added for the EXTERNAL Kudu table use case for which no column definitions are specified since we load the schema from the Kudu table. Added a comment. Line 980: RESULT = new CreateTableStmt(tbl_def); > trailing Done Line 1033: // class doesn't inherit from CreateTableStmt. > should it? To my opinion yes it should. I actually went down the path of refactoring all the CREATE TABLE* statements but it ended up being too complex to add on top of this big patch. Simplifying the CREATE TABLE statements will also allow us to remove some of the weird table option handling we do in TableDef.java. I will leave a TODO for now. Line 1065: primary_keys_val ::= > opt_primary_keys? Done Line 1089: tbl_data_layout ::= > opt_...? Done Line 1139: {: > fix spaces and tabs Done Line 1370: KW_PRIMARY key_ident > what's wrong with KW_KEY? I don't think we can do that. Don't we use "key" for nested types (map)? http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/analysis/AnalysisUtils.java File fe/src/main/java/org/apache/impala/analysis/AnalysisUtils.java: Line 30: static void throwIfNotNullOrNotEmpty(Collection<?> c, String message) > this is actually 'not null *and* not empty'. you can also phrase that as 'n Good point. Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeStmt.java: Line 158: if (fileFormat_ == THdfsFileFormat.KUDU) { > check at top Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 236: String.format("PRIMARY KEY must be used instead of the table property '%s'.", > not good: if you do that on an external table, you get this error message i I like that idea but it's a bit more complicated for Kudu because different properties are valid depending on whether it's an external or managed table. Moved the check to the function below. Let me know if that works ok. Line 310: distributeParam.setPKColumnDefMap(pkColumnDefsByName); > setPkColumn... Done Line 315: private boolean hasPrimaryKeysSpecified() { > hasPrimaryKeySpecified (there's only one, which can be a composite key) Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/analysis/DistributeParam.java File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java: Line 121: org.apache.impala.catalog.Type colType = colDef.getType(); > does simply Type conflict with something? Yeah, there is a conflict with the enum Type in this class. Line 129: if (colType.isStringType() && !exprType.isStringType() > this is basically looking for 'assignment compatible', and i'm sure we alre Done Line 150: builder.append(numBuckets_).append(" BUCKETS"); > sprinkle some checkstates in here (on numbuckets and splitrows; or maybe a I added checks for numBuckets_. Split rows will go away in a follow up patch with the new range partitioning syntax. I left a TODO to add a validate function then. Line 200: literal.setString_literal(expr.getStringValue()); > checkstate that you're getting something valid Done Line 211: void setPKColumnDefMap(Map<String, ColumnDef> pkColumnDefByName) { > setPkC... Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/analysis/TableDef.java File fe/src/main/java/org/apache/impala/analysis/TableDef.java: Line 74: static class TableDefOptions { > 'Options' is enough Done Line 160: fullyQualifiedTableName_ = analyzer.getFqTableName(getTblName()); > stick with fq abbreviation? Done Line 189: for (ColumnDef colDef: getPartitionColumnDefs()) { > this is a bit hard to follow. partition cols aren't defined separately, the These are the columns specified in a PARTITIONED BY clause (non-kudu) and they should be analyzed, no? Sorry, I am not sure I follow your comment. http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java: PS4, Line 97: org.apache > should this have changed? Hm sed is not that smart :) http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/catalog/KuduTable.java File fe/src/main/java/org/apache/impala/catalog/KuduTable.java: PS4, Line 89: com.cloudera > can you add a ref to IMPALA-4271 ? Done Line 111: // Distribution schemes of this Kudu table. Both rang and hash-based distributions are > range Done Line 140: return msTbl.getParameters().get(KuduTable.KEY_TABLE_NAME); > i don't think this is worth a function call, it just makes the code harder Done PS4, Line 160: know > known Done PS4, Line 159: /** : * The number of nodes is not know ahead of time and will be updated during computeStats : * in the scan node. : */ : public int getNumNodes() { return -1; } > I don't see this used Yeah, removed. PS4, Line 175: numClusteringCols_ = 0; > not really related to this change, but it's kind of confusing to have numCl I like Marcel's suggestion, I changed it to be the number of primary key columns. PS4, Line 175: numClusteringCols_ = 0; > those should be the primary key cols Done PS4, Line 226: List<FieldSchema> cols = msTable_.getSd().getCols(); : cols.clear(); > why do we get cols from getCols() and then clear() it? cols is a reference to msTable cols. We clear them here and reload them from Kudu schema in L232. Let me know if it's still not clear or if I should add a comment. PS4, Line 232: cols.add(new FieldSchema(colName, type.toSql().toLowerCase(), null)); > why do we do this? cols isn't used later See my comment above. I can add a comment if it's still not clear. http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/catalog/Table.java File fe/src/main/java/org/apache/impala/catalog/Table.java: PS4, Line 460: msTbl.getTableType().equals(TableType.EXTERNAL_TABLE.toString()); > we shuold probably compare case insensitive to be safe Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: PS4, Line 1147: occurrs > occurs Done PS4, Line 1482: } catch (Exception e) { : try { : // Error creating the table in HMS, drop the managed table from Kudu. : if (!Table.isExternalTable(newTable)) { : KuduCatalogOpExecutor.dropTable(newTable, false); : } : } catch (Exception logged) { : String kuduTableName = newTable.getParameters().get(KuduTable.KEY_TABLE_NAME); : LOG.error(String.format("Failed to drop Kudu table '%s'", kuduTableName), : logged); : throw new RuntimeException(String.format("Failed to create the table '%s' in " + : " the Metastore and the newly created Kudu table '%s' could not be " + : " dropped. The log contains more information.", newTable.getTableName(), : kuduTableName), e); : } : if (e instanceof AlreadyExistsException && params.if_not_exists) return false; : throw new ImpalaRuntimeException( : String.format(HMS_RPC_ERROR_FORMAT_STR, "createTable"), e); > it looks like none of this really needs to be inside the synchronized block Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/service/Frontend.java File fe/src/main/java/org/apache/impala/service/Frontend.java: PS4, Line 232: if (!req.is_delta) { : catalog = new ImpaladCatalog(defaultKuduMasterAddrs_); : } > 1line Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java: PS4, Line 135: if (!hasRangePartitioning) { : tableOpts.setRangePartitionColumns(Collections.<String>emptyList()); : } > I don't think this is necessary Unfortunately it is. I spoke to Dan (from Kudu team) about it. If the user doesn't specify a range partitioning, Kudu by default creates one with all the primary key columns. So, the distribute params we get from Kudu (and use in the SHOW stmt) is different from the distribute params that the user specified. I added a comment to clarify this. Let me know if this is ok. PS4, Line 175: erros > errors Done PS4, Line 192: cols.clear(); > can you indicate in the comment that this doesn't just populate msTbl's col Done PS4, Line 206: new KuduClient > I'm not crazy about this wrapper class thing. It's only used in this file. Done PS4, Line 212: is accessible > exists Done PS4, Line 215: validateTblProperties > how about validateKuduTblExists ? Done PS4, Line 224: Error accessing table in Kudu " + : "master '%s' > This could also print the name. Also to avoid confusing with potential futu Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/main/java/org/apache/impala/util/KuduClient.java File fe/src/main/java/org/apache/impala/util/KuduClient.java: > as I've said I'd vote to remove this, it's only used by 1 class and adds ex Done http://gerrit.cloudera.org:8080/#/c/4414/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java: Line 1353: "functional.alltypestiny", "Columns cannot be specified with an external " + > odd error message. i would expect the 'as select' to be the offending part. Yeah, you're right. Fixed it. Line 1720: AnalyzesOk("create table tab (x int, y int, primary key (X)) " + > i thought kudu is case-sensitive We lowercase the pk columns during the analysis. Isn't that ok? http://gerrit.cloudera.org:8080/#/c/4414/4/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test: Line 6: as select * from functional.alltypestiny > shouldn't this be part of an analyzer test? It used to be that many of these cases were handled during the analysis. Both MJ and Alex suggested we avoid performing checks that are already performed in Kudu (e.g. no boolean primary key columns). Hence, many of these cases are essentially analysis tests that are caught at runtime. Let me know if you prefer to move these back to the analysis. The only issue with this would be keeping these checks consistent with Kudu. Line 30: distribute by hash (x) into 3 buckets stored as kudu > same here, and for the other analysis error test cases in this file See comment above. Line 32: NonRecoverableException: Key column may not have type of BOOL, FLOAT, or DOUBLE > why wouldn't this be an analysis exception? See comment above. Line 46: NonRecoverableException: Got out-of-order key column: name: "y" type: INT32 is_key: true is_nullable: false cfile_block_size: 0 > inscrutable error message This comes from Kudu. I agree it is not user friendly. I'll file the Kudu team to fix this. Line 53: NonRecoverableException: must have at least two hash buckets > error message should point out the offending clause Error message comes from Kudu. That's the drawback of not doing these checks in the analysis. We don't control the error messages :( Line 60: NonRecoverableException: hash bucket schema components must not contain columns in common > same here Same comment as above. I understand this is annoying. The goal is to have the Kudu team fix these error msgs. http://gerrit.cloudera.org:8080/#/c/4414/4/testdata/workloads/functional-query/queries/QueryTest/kudu_crud.test File testdata/workloads/functional-query/queries/QueryTest/kudu_crud.test: Line 1: ==== > might be a good idea to point out at the top that this test contains test c Good idea. Done Line 5: DISTRIBUTE BY RANGE SPLIT ROWS ((10), (30)) STORED AS KUDU > analyzer test? Done http://gerrit.cloudera.org:8080/#/c/4414/4/testdata/workloads/functional-query/queries/QueryTest/kudu_partition_ddl.test File testdata/workloads/functional-query/queries/QueryTest/kudu_partition_ddl.test: Line 6: DISTRIBUTE BY RANGE(name) SPLIT ROWS (('abc')) STORED AS KUDU > analyzer test? Done -- To view, visit http://gerrit.cloudera.org:8080/4414 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-HasComments: Yes
