Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 7:


Still a few minor asks to go through (e.g. add JIRA numbers and few 
clarifications) but I am sending the next patch.

File be/src/service/frontend.cc:

Line 62:     "value should be a comma separated list of hostnames or IP 
> are ports optional or mandatory?
Ports are optional.

File common/thrift/CatalogObjects.thrift:

Line 358:   1: required list<TRangeLiteral> values
> Why not a list of TExpr that are expected to be literals? Seems more future
Hm, TExpr? That sounds a bit too much for simple literal values unless we 
envision using exprs, udfs, etc. Thoughts?

Line 368: // the type parameter.
> which type parameter?
Sorry, there was an enum that got removed and I forgot to update the comments. 

File common/thrift/JniCatalog.thrift:

Line 398:   14: optional list<CatalogObjects.TDistributeParam> distribute_by;
> for consistency let's remove trailing ";"

File fe/src/main/cup/sql-parser.cup:

Line 1033: // class doesn't inherit from CreateTableStmt. 
> whitespace

Line 1047: // Used for creating external Kudu tables for which the schema is 
loaded from Kudu.
> There seem to be more uses of this production, so this comment could be mis

Line 1112: // or one RANGE clause
> typo: clauses

File fe/src/main/java/org/apache/impala/analysis/ColumnDef.java:

Line 93:   void setIsPrimaryKey() { isPrimaryKey_ = true; }
> do we need this?
No, removed.

Line 191:       
> can check return value of put()
Good point. Keep forgetting this. Done

File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 208:    * Analyzes and checks table properties which are common for both 
managed and external
> typo: common to

Line 255:         "PARTITIONED BY cannot be used with an Kudu table.");
> typo: a Kudu table

Line 273:     AnalysisUtils.throwIfNullOrEmpty(getPrimaryKeyColumnDefs(),
> Shouldn't this check hasPrimaryKey()?

Line 284:             "zero. Given number of replicas is: " + r.toString() + 
> remove trailing .' or add the opening single-quote

Line 318:   private boolean hasPrimaryKey() {
> Isn't it enough to check primaryKeyColDefs_ in tableDef_?
You're right, we can do that since tableDef_ is analyzed.

File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java:

Line 105:     for (String colName: colNames_) {
> we could specify the same distribution column multiple times
Not sure I follow. Are we doing something to prevent this?

Line 127:             throw new AnalysisException("Split values cannot be 
> do we have a test for this?
Hm no, added one. Thanks

Line 223:     colNames_.addAll(colNames);
> do we need toLower()?
We've already lowered the column names before the call to setColumnNames().

File fe/src/main/java/org/apache/impala/analysis/TableDef.java:

Line 69:   // Populated during analysis.
> Authoritative list of primary key column definitions populated during analy

Line 177:     fqTableName_.analyze();
> Do you know if Kudu has more permissive or more restrictive constraints on 
Agreed. Will do.

Line 181:     if (analyzer.dbContainsTable(getTblName().getDb(), getTbl(), 
> Are we going to check Sentry privs for Kudu tables? Also ok to defer this f
It's still being discussed :). Lineage as well.

Line 220:    * Analyzes the primary key columns. Primary keys are only 
supported for Kudu
> Replace the second sentence with a brief description what this checks. It d

Line 234:       StringBuilder columnDefStr = new StringBuilder();
> Not used?

File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

Line 1117:     if (db != null && params.cascade) dropTablesFromKudu(db);
> I think it might be a good idea to do this under the metastoreDdlLock_ as w

File fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java:

Line 47: /**
> newline before

Line 231:     } catch (Exception e) {
> Did you address my comment more nuanced checking here to distinguish connec
I don't think we can differentiate these two given the way the Kudu client 
works (i.e. it doesn't give you the ability to check if it is successfully 
connected to the master). I added 'e' in the construction of 

File fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java:

Line 1350:     // CTAS in an external Kudu table
> What's the rationale for not allowing this?
Hm, what does it really mean to allow CTAS on external Kudu table? We have the 
Kudu table schema and the inferred schema from the select stmt. Do you think we 
should try to see if they match and then create the table and perform the 
insert? It felt kind of weird to allow something like this. If something fails 
the user will have to get the schema from Kudu and figure out what the insert 
should look like. Instead the user can 1. create an external table, 2. describe 
to see the table schema, 3. do an insert select. Thoughts?

Line 1716:     AnalyzesOk("create table tab (x int, y string, primary key(x, 
y)) " +
> Can you add a brief comment explaining the table layout for this one?
Done. Let me know if that's better.

Line 1717:         "distribute by hash(x) into 3 buckets, range(y) split rows " 
> can we distribute by range, hash?
No, range needs to be the last one. Fails during parsing. Added another parser 

Line 1723:     AnalyzesOk("create table tab (a int, b int, c int, d int, 
primary key (a, b, c))" +
> What happens if I specify PARTITION BY for a Kudu table?
You get an error but the error was saying that no DISTRIBUTE BY is used. Fixed 
it to say that PARTITIONED BY is not allowed for Kudu tables. Also added test. 

Line 1739:     AnalysisError(String.format("create table tab (x int) distribute 
by hash (x) " +
> What about specifying the other params in tblproperties like the distribute
Well these were never specified in TBLPROPERTIES so I am not sure if we should 
be checking for any weird behavior wrt to properties set there. Besides they 
will be ignored and/or an error message will be thrown if some crucial clause 
is missing. Let me know if you have some specific example in mind.

Line 1748:     AnalysisError("create table tab (x int primary key, primary 
key(x)) stored " +
> Add negative test where two ColumnDefs are declared as PK.

Line 1762:     AnalysisError("create table tab (a int, b int, c int, d int, 
primary key(a, b, c)) " +
> Do we have tests somewhere for checking which types are supported with Kudu
We do in kudu_create.test and in test_kudu.py. Added a few more in 
kudu_create.test. Let me know if there is anything else missing.

Line 1772:     // No float split keys
> Even if the column type is float? If so, might want to test that.
float is not allowed as a pk column. We have test for that in kudu_create.test.

Line 1810:     // DISTRIBUTE BY is required for managed tables.
> Primary key is also required, do we have a test?
Ha, interestingly not :). Done

Line 1812:         "Table partitioning must be specified for managed Kudu 
> Let's rephrase this as "Table distribution must be specified ..."

PS6, Line 1828: AnalysisError("create external table t tblproperties (" +
> Not that it makes any sense, but hat happens with:
That results in parsing error. Let me know if you want me to add a test in 

Line 1863:   public void TestCreateAvroTest() {
> Add a test with some of the Kudu-specific clauses and STORED AS AVRO

Line 2822:   public void TestShowFiles() throws AnalysisException {
> DO we have a TODO/JIRA somewhere to go through all DDL/SHOW commands and ma
We have some tests for SHOW CREATE TABLE in test_kudu.py but we need a JIRA for 
show partitions.

File fe/src/test/java/org/apache/impala/analysis/ParserTest.java:

Line 1632
> Why remove this? It will break my setup :)
Sorry, I had issues with this on when running parser tests only. Done

Line 2235:     ParsesOk("CREATE TABLE foo (i INT PRIMARY KEY, PRIMARY KEY(i)) 
> Add case like this:

Line 2541:     ParsesOk("CREATE TABLE Foo TBLPROPERTIES ('a'='b', 'c'='d') AS 
SELECT * from bar");
> nit: let's be consistent with how we chop up string literals across lines (

File fe/src/test/java/org/apache/impala/catalog/HdfsStorageDescriptorTest.java:

Line 58:         "org.apache.impala.hive.serde.ParquetInputFormat",
> revert

File testdata/bin/generate-schema-statements.py:

Line 203:   if row_format and file_format == 'text':
> I think row_format is also valid for sequence files (maybe we're not using 
Hm, I couldn't find anything in the docs or in our tests.

File testdata/datasets/functional/functional_schema_template.sql:

Line 823: distribute by range(id) split rows ((1003), (1007)) stored as kudu;
> ah so much better


Line 218: create table kudu_table_clone like functional_kudu.alltypes
> Why can't we check this in analysis?

File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test:

Line 3: # This test file contains several cases for what basically amount to 
analysis errors,
> Should mention that some of this behavior is intentional and why.

Line 11: ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
> Do we get the same message for DECIMAL and complex types? (No need to add a
Yes, added tests. Done

Line 14: create table t primary key (id) distribute by hash (id) into 3 buckets
> Add test for creating and querying a table that has Impala keywords as the 
Yeah you were right about using getIdentSql(). This doesn't work as expected. 
Fixed it and added a test.

File tests/common/kudu_test_suite.py:

PS6, Line 47:     super(KuduTestSuite, cls).setup_class()
            :     if os.environ["KUDU_IS_SUPPORTED"] == "false":
            :       pytest.skip("Kudu is not supported")
> Maybe move L47 to L50 (switch the ordering of these)? This isn't just a nit

PS6, Line 51:   @classmethod
            :   def teardown_class(cls):
            :     pass
> Not needed:

File tests/custom_cluster/test_kudu.py:

PS6, Line 20: from kudu.client import INT32
> ImportError: I think this needs to be kudu.schema

PS6, Line 34:   @classmethod
            :   def add_test_dimensions(cls):
            :     super(CustomClusterTestSuite, cls).add_test_dimensions()
            :     cls.TestMatrix.add_constraint(lambda v:
            :         v.get_value('table_format').file_format == 'parquet' and
            :         v.get_value('table_format').compression_codec == 'none')
> I don't think this is needed: no tests here use a vector.

PS6, Line 43:   def test_kudu_master_hosts(self, cursor, kudu):
> kudu_client, not kudu. Note this isn't just a rename; it's to make use of t

PS6, Line 45:     with self.temp_kudu_table(kudu, [INT32]) as kudu_table:
> kudu_client

PS6, Line 50:             props)
> SyntaxError: missing a closing paren here

File tests/query_test/test_kudu.py:

PS6, Line 27: from pprint import pprint
> Remove: unused import

PS6, Line 30: from copy import copy
            : from tests.beeswax.impala_beeswax import ImpalaBeeswaxException
            : from tests.common.impala_test_suite import ImpalaTestSuite
            : from tests.common.skip import SkipIf
            : from tests.common.test_dimensions import 
            : from tests.common.test_vector import TestDimension
> All imported but not used.

PS6, Line 49:   def test_kudu_scan_node(self, vector):
> This test is racy in exhaustive: two tests will be competing for the same d
Thanks. Done

Line 57:   def test_insert_update_delete(self, vector, unique_database):
> Can we keep the .test file name and the python test function consistent? i.

File tests/query_test/test_kudu.py:

Line 192:        'show_create_sql' can be templates that can be used with 
str.format(). format()
> In our other SHOW CREATE TABLE tests we also try to run execute the output 
See comment below for SPLIT ROWS. Unfortunately, the output is not always 
executable. To be fixed in Kudu.

Line 204:         CREATE TABLE {table} (c INT PRIMARY KEY)
> add comment to the PK column
That's interesting. Actually, currently don't work because we can't persist 
them in Kudu and the first time the schema is loaded from Kudu we override the 
HMS column defs, so the comments are lost. So, option 1 is to store them in 
tblproperies and option 2 to ignore and/or throw error if specified for Kudu 
tables. Thoughts?

Line 213:         TBLPROPERTIES 
> Why include these TBLPROPERTIES?
Good question. If the user doesn't specify the kudu master_addresses we 
populate tblproperties with the known values. The issue is that we can't tell 
if the user added this property or it was generated. I chose to always display 
this property but I can change it if you don't like it.

> is the SPLIT ROWS (...) legal syntax?
No, but Kudu's API doesn't currently return any information about range 
partitions except the column names used.

To view, visit http://gerrit.cloudera.org:8080/4414
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to