Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables ......................................................................
Patch Set 3: (23 comments) http://gerrit.cloudera.org:8080/#/c/4414/3/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: PS3, Line 344: // Enum listing all possible DISTRIBUTE BY types : enum TDistributeType { : HASH, : RANGE, > I don't see this used Correct. Removed. PS3, Line 350: // Parameters needed for hash distribution : struct TDistributeByHashParam { : 1: required list<string> columns : 2: required i32 num_buckets : } : : struct TRangeLiteral { : 1: optional i64 int_literal : 2: optional string string_literal : } : : struct TRangeLiteralList { : 1: required list<TRangeLiteral> values : } : : // A range distribution is identified by a list of columns and a series of split rows. : struct TDistributeByRangeParam { : 1: required list<string> columns : 2: optional list<TRangeLiteralList> split_rows; : } : : // Parameters for the DISTRIBUTE BY clause. The actual distribution is identified by : // the type parameter. : struct TDistributeParam { : // Set if type is set to HASH : 1: optional TDistributeByHashParam by_hash_param; : : // Set if type is set to RANGE : 2: optional TDistributeByRangeParam by_range_param; : } > I see that these are used in the serialized catalog objects, but given that We store the distribute by params in the KuduTable class which is then serialized and sent to al the impalads. So, I think we do need them in this case, unless I misunderstood your comment. PS3, Line 390: : // Distribution schemes : 4: required list<TDistributeParam> distribute_by > same for this, can we get rid of it? See my comment above. We do load and serialize the distribute by params in the KuduTable class. http://gerrit.cloudera.org:8080/#/c/4414/3/common/thrift/JniCatalog.thrift File common/thrift/JniCatalog.thrift: PS3, Line 398: CatalogObjects.TDistributeParam > wrt my comments in CatalogObjects, I guess we'd need them here, but I don't See my responses to your comments and let me know if it's still not clear. http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/ColumnDefOptions.java File fe/src/main/java/org/apache/impala/analysis/ColumnDefOptions.java: PS3, Line 21: * Placeholder for the column definition options of a CREATE TABLE statement. : * Contains the list of column definitions and, optionally, the list of column names : * specified using the PRIMARY KEY keyword. > this feels like a weird abstraction... do we really need this class? This was used only during parsing to store the "(col type, col type, PRIMARY KEY (col, ..))" part of the create table stmt. I modified the parser so that this isn't needed anymore. Done http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java: Line 87: InsertStmt getInsertStmt() { return insertStmt_; } > why change visibility? Done http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java: PS3, Line 344: getTblPrimaryKeyColumnNames(), null, > Why would there be PK col names? Done PS3, Line 344: > nit extra space Done http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: PS3, Line 167: if (getFileFormat() != THdfsFileFormat.KUDU && KuduTable.KUDU_STORAGE_HANDLER.equals( : getTblProperties().get(KuduTable.KEY_STORAGE_HANDLER))) { : throw new AnalysisException(KUDU_STORAGE_HANDLER_ERROR_MESSAGE); : } : : // Avro tables can have empty column defs because they can infer them from the Avro : // schema. Likewise for external Kudu tables, the schema can be read from Kudu. : if (getColumnDefs().isEmpty() && getFileFormat() != THdfsFileFormat.AVRO : && getFileFormat() != THdfsFileFormat.KUDU) { : throw new AnalysisException("Table requires at least 1 column"); : } : : if (getFileFormat() == THdfsFileFormat.KUDU) { : analyzeKuduFormat(analyzer); : } else { : AnalysisUtils.throwIfNotNullOrNotEmpty(getDistributeParams(), : "Only Kudu tables can use DISTRIBUTE BY clause."); : throwIfPrimaryKeysWereSpecified("Only Kudu tables can specify a PRIMARY KEY."); : } : : if (getFileFormat() == THdfsFileFormat.AVRO) { : setColumnDefs(analyzeAvroSchema(analyzer)); : if (getColumnDefs().isEmpty()) { : throw new AnalysisException( : "An Avro table requires column definitions or an Avro schema."); : } : AvroSchemaUtils.setFromSerdeComment(getColumnDefs()); : } : } > what if we were to move everything Kudu related here into analyzeKuduFormat Done. Let me know if that looks any better now. PS3, Line 279: throw new AnalysisException(String.format("Column '%s' (type: '%s') cannot " + : "be used as a primary key. Keys must be integer or string types.", : pkColDef.getColName(), pkColDef.getType().toSql())); > I don't see a test case for this error There are a couple in AnalyzeDDLTest.java (L1738, L1743). PS3, Line 277: ColumnDef pkColDef = getPrimaryKeyColumnDefs().get(i); : if (!KuduUtil.isSupportedKeyType(pkColDef.getType())) { : throw new AnalysisException(String.format("Column '%s' (type: '%s') cannot " + : "be used as a primary key. Keys must be integer or string types.", : pkColDef.getColName(), pkColDef.getType().toSql())); : } : if (!pkColDef.equals(getColumnDefs().get(i))) { : throw new AnalysisException(String.format("Primary key columns in Kudu " + : "tables must be declared first and in-order. Key column '%s' is " + : "out-of-order.", pkColDef.getColName())); : } > If we weren't to do these checks (i.e. that they're supported types and tha Done PS3, Line 315: * - Only primary key column can be used in distribution definitions : * - Distributions may be either hash or ranged (value intervals) : * - Multiple distributions can be used/mixed : * - A column can only be used in one hash distribution : * - A range distribution can only be used once and must be the last definition : * (both enforced by the grammar). > What Kudu supports for partitioning is a bit tricky when we get into nested Done PS3, Line 358: throwIfPrimaryKeysWereSpecified > I do prefer to change this to return a bool and have callers throw: bool ha Done http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/DistributeParam.java File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java: PS3, Line 156: if (splitRows_ == null) { : builder.append("..."); > can this happen? wouldn't l115 throw? This is possible because an object of DistributeParam class is created in two cases: 1. During the analysis of a create table stmt for Kudu. In this case, split rows for range partitioning are guaranteed to be set so L115 is ok. 2. When we load the schema and distribution schemes in the catalog for a Kudu table. In this case, Kudu's API doesn't currently return any information about the split points so this field could be null. With the advent of proper range partitioning syntax this should go away. Kudu will have to provide information about the range partitions including split points or ranges. Would it be ok to add a TODO and reference the pending JIRA? Or let me know if you have another preference. PS3, Line 191: if (splitRows_ == null) { : result.setBy_range_param(rangeParam); : return result; : } > this is possible? what does it mean? See explanation above. http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java File fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java: PS3, Line 24: class TableDataLayout { : : private final List<ColumnDef> partitionColDefs_; : private final List<DistributeParam> distributeParams_; > Not that we're doing this in this review, but we need to think about how th Agreed. Left a TODO. PS3, Line 49: List<ColumnDef> getPartitionColumnDefs() { return partitionColDefs_; } : List<DistributeParam> getDistributeParams() { return distributeParams_; } > any reason these aren't public? No need to be public. The default access method gives public access to classes in the same package and that's all we need here. Let me know if it is confusing. http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/TableDef.java File fe/src/main/java/org/apache/impala/analysis/TableDef.java: PS3, Line 1: // Copyright 2016 Cloudera Inc. : // : // Licensed under the Apache License, Version 2.0 (the "License"); : // you may not use this file except in compliance with the License. : // You may obtain a copy of the License at : // : // http://www.apache.org/licenses/LICENSE-2.0 : // : // Unless required by applicable law or agreed to in writing, software : // distributed under the License is distributed on an "AS IS" BASIS, : // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. : // See the License for the specific language governing permissions and : // limitations under the License. > this and maybe others need the new license header: arrrgh. Thanks. Done PS3, Line 126: return fullyQualifiedTableName_ != null ? fullyQualifiedTableName_ : tableName_; > Do we need to support calling this before analysis? It'd be nice if we just We do call this function before tableName_ has been analyzed. One example is during the parsing of CREATE TABLE LIKE statements where we access the specified table name from a TableDef object. PS3, Line 137: List<DistributeParam> getDistributeParams() { : return dataLayout_.getDistributeParams(); : } > 1line? 1 char off :) http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/catalog/KuduTable.java File fe/src/main/java/org/apache/impala/catalog/KuduTable.java: PS3, Line 258: Creates a tmp Kudu table. The table is not added to the Kudu storage engine or the : * HMS. > I wasn't sure what this meant. How about: Done PS3, Line 259: primarily > remove Done http://gerrit.cloudera.org:8080/#/c/4414/3/infra/python/deps/requirements.txt File infra/python/deps/requirements.txt: PS3, Line 83: The kudu-python : # version in download_requirements must be kept in sync with this version. > gotta update download_requirements as well Done -- To view, visit http://gerrit.cloudera.org:8080/4414 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-HasComments: Yes