Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4414/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

PS3, Line 350: // Parameters needed for hash distribution
             : struct TDistributeByHashParam {
             :   1: required list<string> columns
             :   2: required i32 num_buckets
             : }
             : 
             : struct TRangeLiteral {
             :   1: optional i64 int_literal
             :   2: optional string string_literal
             : }
             : 
             : struct TRangeLiteralList {
             :   1: required list<TRangeLiteral> values
             : }
             : 
             : // A range distribution is identified by a list of columns and a 
series of split rows.
             : struct TDistributeByRangeParam {
             :   1: required list<string> columns
             :   2: optional list<TRangeLiteralList> split_rows;
             : }
             : 
             : // Parameters for the DISTRIBUTE BY clause. The actual 
distribution is identified by
             : // the type parameter.
             : struct TDistributeParam {
             :   // Set if type is set to HASH
             :   1: optional TDistributeByHashParam by_hash_param;
             : 
             :   // Set if type is set to RANGE
             :   2: optional TDistributeByRangeParam by_range_param;
             : }
> We store the distribute by params in the KuduTable class which is then seri
I don't see why this needs to be in the TKuduTable though. I think we just need 
this in the TCreateTableParams. I wouldn't mind except it's more work to reason 
about when it's set/not set in a TKuduTable.


PS3, Line 390: 
             :   // Distribution schemes
             :   4: required list<TDistributeParam> distribute_by
> See my comment above. We do load and serialize the distribute by params in 
I don't see why this needs to be in TKuduTable, only TCreateTableParams.


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/DistributeParam.java
File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java:

PS3, Line 156:       if (splitRows_ == null) {
             :         builder.append("...");
> This is possible because an object of DistributeParam class is created in t
In case #2 why do we even bother creating a DistributeParams? 

The partitioning/distribution info only really makes sense at the time of table 
creation I think, right? This goes back to our conversation (in person) about 
how it'd probably make sense to separate the partitioning part of the command 
from CREATE TABLE, and do partitioning more like how we would for non-Kudu 
tables, i.e. separate commands. Obviously that hasn't been worked through and 
realistically we probably won't get to that in time.

If we don't get to something like that, I think the only reason we'd want #2 
(in terms of loading DistributeParams) is SHOW CREATE TABLE, so this class only 
needs to support toSql() in case #2. Is that right? Can you add a comment in 
the class header or the relevant constructor where we might get case #2 so it's 
easier to reason about?


A TODO / JIRA ref about resolving the '...' would be good. It might be good to 
add a test case with this output if there isn't one already to make sure we fix 
it properly and/or it doesn't change in some other unexpected way.


PS3, Line 191:       if (splitRows_ == null) {
             :         result.setBy_range_param(rangeParam);
             :         return result;
             :       }
> See explanation above.
I guess it's confusing to me because TDistributeParam doesn't seem to be 
necessary in case #2.

In case #1 it gets added to a TCreateTableStmt object. In case #2 it gets added 
to a TKuduTable object, but it seems unnecessary given that no other backend or 
even the catalog care about the distribute params - right? If that's true, I 
think we could avoid storing this in KuduTable, then avoid calling this in 
KuduTable.getTKuduTable(), and then throwing here if splitRows_ is null since I 
don't think this should be called in this case.


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/TableDef.java
File fe/src/main/java/org/apache/impala/analysis/TableDef.java:

PS3, Line 126: return fullyQualifiedTableName_ != null ? 
fullyQualifiedTableName_ : tableName_;
> We do call this function before tableName_ has been analyzed. One example i
ah ok, can you add a comment?


http://gerrit.cloudera.org:8080/#/c/4414/3/fe/src/main/java/org/apache/impala/analysis/ToSqlUtils.java
File fe/src/main/java/org/apache/impala/analysis/ToSqlUtils.java:

PS3, Line 243: List<String> primaryKeysSql, String kuduDistributeByParams,
can you update the comment, especially around which of these should/can be null 
vs. empty.


-- 
To view, visit http://gerrit.cloudera.org:8080/4414
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to