[Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Dimitris Tsirogiannis (Code Review) Mon, 10 Oct 2016 08:52:14 -0700

Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
......................................................................



Patch Set 3:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/4414/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

PS3, Line 344: // Enum listing all possible DISTRIBUTE BY types
             : enum TDistributeType {
             :   HASH,
             :   RANGE,
> I don't see this used
Correct. Removed.


PS3, Line 350: // Parameters needed for hash distribution
             : struct TDistributeByHashParam {
             :   1: required list<string> columns
             :   2: required i32 num_buckets
             : }
             : 
             : struct TRangeLiteral {
             :   1: optional i64 int_literal
             :   2: optional string string_literal
             : }
             : 
             : struct TRangeLiteralList {
             :   1: required list<TRangeLiteral> values
             : }
             : 
             : // A range distribution is identified by a list of columns and a 
series of split rows.
             : struct TDistributeByRangeParam {
             :   1: required list<string> columns
             :   2: optional list<TRangeLiteralList> split_rows;
             : }
             : 
             : // Parameters for the DISTRIBUTE BY clause. The actual 
distribution is identified by
             : // the type parameter.
             : struct TDistributeParam {
             :   // Set if type is set to HASH
             :   1: optional TDistributeByHashParam by_hash_param;
             : 
             :   // Set if type is set to RANGE
             :   2: optional TDistributeByRangeParam by_range_param;
             : }
> I see that these are used in the serialized catalog objects, but given that
We store the distribute by params in the KuduTable class which is then 
serialized and sent to al the impalads. So, I think we do need them in this 
case, unless I misunderstood your comment.


PS3, Line 390: 
             :   // Distribution schemes
             :   4: required list<TDistributeParam> distribute_by
> same for this, can we get rid of it?
See my comment above. We do load and serialize the distribute by params in the 
KuduTable class.


http://gerrit.cloudera.org:8080/#/c/4414/3/common/thrift/JniCatalog.thrift
File common/thrift/JniCatalog.thrift:

PS3, Line 398: CatalogObjects.TDistributeParam
> wrt my comments in CatalogObjects, I guess we'd need them here, but I don't
See my responses to your comments and let me know if it's still not clear.


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/ColumnDefOptions.java
File fe/src/main/java/org/apache/impala/analysis/ColumnDefOptions.java:

PS3, Line 21:  * Placeholder for the column definition options of a CREATE 
TABLE statement.
            :  * Contains the list of column definitions and, optionally, the 
list of column names
            :  * specified using the PRIMARY KEY keyword.
> this feels like a weird abstraction... do we really need this class?
This was used only during parsing to store the "(col type, col type, PRIMARY 
KEY (col, ..))" part of the create table stmt. I modified the parser so that 
this isn't needed anymore. Done


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

Line 87:   InsertStmt getInsertStmt() { return insertStmt_; }
> why change visibility?
Done


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

PS3, Line 344: getTblPrimaryKeyColumnNames(),  null,
> Why would there be PK col names?
Done


PS3, Line 344:   
> nit extra space
Done


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

PS3, Line 167:     if (getFileFormat() != THdfsFileFormat.KUDU && 
KuduTable.KUDU_STORAGE_HANDLER.equals(
             :         getTblProperties().get(KuduTable.KEY_STORAGE_HANDLER))) {
             :       throw new 
AnalysisException(KUDU_STORAGE_HANDLER_ERROR_MESSAGE);
             :     }
             : 
             :     // Avro tables can have empty column defs because they can 
infer them from the Avro
             :     // schema. Likewise for external Kudu tables, the schema can 
be read from Kudu.
             :     if (getColumnDefs().isEmpty() && getFileFormat() != 
THdfsFileFormat.AVRO
             :         && getFileFormat() != THdfsFileFormat.KUDU) {
             :       throw new AnalysisException("Table requires at least 1 
column");
             :     }
             : 
             :     if (getFileFormat() == THdfsFileFormat.KUDU) {
             :       analyzeKuduFormat(analyzer);
             :     } else {
             :       
AnalysisUtils.throwIfNotNullOrNotEmpty(getDistributeParams(),
             :           "Only Kudu tables can use DISTRIBUTE BY clause.");
             :       throwIfPrimaryKeysWereSpecified("Only Kudu tables can 
specify a PRIMARY KEY.");
             :     }
             : 
             :     if (getFileFormat() == THdfsFileFormat.AVRO) {
             :       setColumnDefs(analyzeAvroSchema(analyzer));
             :       if (getColumnDefs().isEmpty()) {
             :         throw new AnalysisException(
             :             "An Avro table requires column definitions or an 
Avro schema.");
             :       }
             :       AvroSchemaUtils.setFromSerdeComment(getColumnDefs());
             :     }
             :   }
> what if we were to move everything Kudu related here into analyzeKuduFormat
Done. Let me know if that looks any better now.


PS3, Line 279:         throw new AnalysisException(String.format("Column '%s' 
(type: '%s') cannot " +
             :             "be used as a primary key. Keys must be integer or 
string types.",
             :             pkColDef.getColName(), pkColDef.getType().toSql()));
> I don't see a test case for this error
There are a couple in AnalyzeDDLTest.java (L1738, L1743).


PS3, Line 277:       ColumnDef pkColDef = getPrimaryKeyColumnDefs().get(i);
             :       if (!KuduUtil.isSupportedKeyType(pkColDef.getType())) {
             :         throw new AnalysisException(String.format("Column '%s' 
(type: '%s') cannot " +
             :             "be used as a primary key. Keys must be integer or 
string types.",
             :             pkColDef.getColName(), pkColDef.getType().toSql()));
             :       }
             :       if (!pkColDef.equals(getColumnDefs().get(i))) {
             :         throw new AnalysisException(String.format("Primary key 
columns in Kudu " +
             :             "tables must be declared first and in-order. Key 
column '%s' is " +
             :             "out-of-order.", pkColDef.getColName()));
             :       }
> If we weren't to do these checks (i.e. that they're supported types and tha
Done


PS3, Line 315:    * - Only primary key column can be used in distribution 
definitions
             :    * - Distributions may be either hash or ranged (value 
intervals)
             :    * - Multiple distributions can be used/mixed
             :    * - A column can only be used in one hash distribution
             :    * - A range distribution can only be used once and must be 
the last definition
             :    *   (both enforced by the grammar).
> What Kudu supports for partitioning is a bit tricky when we get into nested
Done


PS3, Line 358: throwIfPrimaryKeysWereSpecified
> I do prefer to change this to return a bool and have callers throw: bool ha
Done


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/DistributeParam.java
File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java:

PS3, Line 156:       if (splitRows_ == null) {
             :         builder.append("...");
> can this happen? wouldn't l115 throw?
This is possible because an object of DistributeParam class is created in two 
cases: 
1. During the analysis of a create table stmt for Kudu. In this case, split 
rows for range partitioning are guaranteed to be set so L115 is ok. 
2. When we load the schema and distribution schemes in the catalog for a Kudu 
table. In this case, Kudu's API doesn't currently return any information about 
the split points so this field could be null. 

With the advent of proper range partitioning syntax this should go away. Kudu 
will have to provide information about the range partitions including split 
points or ranges. Would it be ok to add a TODO and reference the pending JIRA? 
Or let me know if you have another preference.


PS3, Line 191:       if (splitRows_ == null) {
             :         result.setBy_range_param(rangeParam);
             :         return result;
             :       }
> this is possible? what does it mean?
See explanation above.


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java
File fe/src/main/java/org/apache/impala/analysis/TableDataLayout.java:

PS3, Line 24: class TableDataLayout {
            : 
            :   private final List<ColumnDef> partitionColDefs_;
            :   private final List<DistributeParam> distributeParams_;
> Not that we're doing this in this review, but we need to think about how th
Agreed. Left a TODO.


PS3, Line 49:   List<ColumnDef> getPartitionColumnDefs() { return 
partitionColDefs_; }
            :   List<DistributeParam> getDistributeParams() { return 
distributeParams_; }
> any reason these aren't public?
No need to be public. The default access method gives public access to classes 
in the same package and that's all we need here. Let me know if it is confusing.


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/TableDef.java
File fe/src/main/java/org/apache/impala/analysis/TableDef.java:

PS3, Line 1: // Copyright 2016 Cloudera Inc.
           : //
           : // Licensed under the Apache License, Version 2.0 (the "License");
           : // you may not use this file except in compliance with the License.
           : // You may obtain a copy of the License at
           : //
           : // http://www.apache.org/licenses/LICENSE-2.0
           : //
           : // Unless required by applicable law or agreed to in writing, 
software
           : // distributed under the License is distributed on an "AS IS" 
BASIS,
           : // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
           : // See the License for the specific language governing permissions 
and
           : // limitations under the License.
> this and maybe others need the new license header:
arrrgh. Thanks. Done


PS3, Line 126: return fullyQualifiedTableName_ != null ? 
fullyQualifiedTableName_ : tableName_;
> Do we need to support calling this before analysis? It'd be nice if we just
We do call this function before tableName_ has been analyzed. One example is 
during the parsing of CREATE TABLE LIKE  statements where we access the 
specified table name from a TableDef object.


PS3, Line 137:   List<DistributeParam> getDistributeParams() {
             :     return dataLayout_.getDistributeParams();
             :   }
> 1line?
1 char off :)


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/catalog/KuduTable.java
File fe/src/main/java/org/apache/impala/catalog/KuduTable.java:

PS3, Line 258: Creates a tmp Kudu table. The table is not added to the Kudu 
storage engine or the
             :    * HMS.
> I wasn't sure what this meant. How about:
Done


PS3, Line 259: primarily
> remove
Done


http://gerrit.cloudera.org:8080/#/c/4414/3/infra/python/deps/requirements.txt
File infra/python/deps/requirements.txt:

PS3, Line 83: The kudu-python
            : # version in download_requirements must be kept in sync with this 
version.
> gotta update download_requirements as well
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4414
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Reply via email to