Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables ......................................................................
Patch Set 7: (63 comments) Still a few minor asks to go through (e.g. add JIRA numbers and few clarifications) but I am sending the next patch. http://gerrit.cloudera.org:8080/#/c/4414/7/be/src/service/frontend.cc File be/src/service/frontend.cc: Line 62: "value should be a comma separated list of hostnames or IP addresses."); > are ports optional or mandatory? Ports are optional. http://gerrit.cloudera.org:8080/#/c/4414/7/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: Line 358: 1: required list<TRangeLiteral> values > Why not a list of TExpr that are expected to be literals? Seems more future Hm, TExpr? That sounds a bit too much for simple literal values unless we envision using exprs, udfs, etc. Thoughts? Line 368: // the type parameter. > which type parameter? Sorry, there was an enum that got removed and I forgot to update the comments. Done http://gerrit.cloudera.org:8080/#/c/4414/7/common/thrift/JniCatalog.thrift File common/thrift/JniCatalog.thrift: Line 398: 14: optional list<CatalogObjects.TDistributeParam> distribute_by; > for consistency let's remove trailing ";" Done http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: Line 1033: // class doesn't inherit from CreateTableStmt. > whitespace Done Line 1047: // Used for creating external Kudu tables for which the schema is loaded from Kudu. > There seem to be more uses of this production, so this comment could be mis Done Line 1112: // or one RANGE clause > typo: clauses Done http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/analysis/ColumnDef.java File fe/src/main/java/org/apache/impala/analysis/ColumnDef.java: Line 93: void setIsPrimaryKey() { isPrimaryKey_ = true; } > do we need this? No, removed. Line 191: Preconditions.checkState(!colDefsByColName.containsKey(colDef.getColName())); > can check return value of put() Good point. Keep forgetting this. Done http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 208: * Analyzes and checks table properties which are common for both managed and external > typo: common to Done Line 255: "PARTITIONED BY cannot be used with an Kudu table."); > typo: a Kudu table Done Line 273: AnalysisUtils.throwIfNullOrEmpty(getPrimaryKeyColumnDefs(), > Shouldn't this check hasPrimaryKey()? Done Line 284: "zero. Given number of replicas is: " + r.toString() + ".'"); > remove trailing .' or add the opening single-quote Done Line 318: private boolean hasPrimaryKey() { > Isn't it enough to check primaryKeyColDefs_ in tableDef_? You're right, we can do that since tableDef_ is analyzed. http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/analysis/DistributeParam.java File fe/src/main/java/org/apache/impala/analysis/DistributeParam.java: Line 105: for (String colName: colNames_) { > we could specify the same distribution column multiple times Not sure I follow. Are we doing something to prevent this? Line 127: throw new AnalysisException("Split values cannot be NULL"); > do we have a test for this? Hm no, added one. Thanks Line 223: colNames_.addAll(colNames); > do we need toLower()? We've already lowered the column names before the call to setColumnNames(). http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/analysis/TableDef.java File fe/src/main/java/org/apache/impala/analysis/TableDef.java: Line 69: // Populated during analysis. > Authoritative list of primary key column definitions populated during analy Done Line 177: fqTableName_.analyze(); > Do you know if Kudu has more permissive or more restrictive constraints on Agreed. Will do. Line 181: if (analyzer.dbContainsTable(getTblName().getDb(), getTbl(), Privilege.CREATE) > Are we going to check Sentry privs for Kudu tables? Also ok to defer this f It's still being discussed :). Lineage as well. Line 220: * Analyzes the primary key columns. Primary keys are only supported for Kudu > Replace the second sentence with a brief description what this checks. It d Done Line 234: StringBuilder columnDefStr = new StringBuilder(); > Not used? Done http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: Line 1117: if (db != null && params.cascade) dropTablesFromKudu(db); > I think it might be a good idea to do this under the metastoreDdlLock_ as w Done http://gerrit.cloudera.org:8080/#/c/4414/7/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java: Line 47: /** > newline before Done Line 231: } catch (Exception e) { > Did you address my comment more nuanced checking here to distinguish connec I don't think we can differentiate these two given the way the Kudu client works (i.e. it doesn't give you the ability to check if it is successfully connected to the master). I added 'e' in the construction of ImpalaRuntimeException. http://gerrit.cloudera.org:8080/#/c/4414/6/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java: Line 1350: // CTAS in an external Kudu table > What's the rationale for not allowing this? Hm, what does it really mean to allow CTAS on external Kudu table? We have the Kudu table schema and the inferred schema from the select stmt. Do you think we should try to see if they match and then create the table and perform the insert? It felt kind of weird to allow something like this. If something fails the user will have to get the schema from Kudu and figure out what the insert should look like. Instead the user can 1. create an external table, 2. describe to see the table schema, 3. do an insert select. Thoughts? Line 1716: AnalyzesOk("create table tab (x int, y string, primary key(x, y)) " + > Can you add a brief comment explaining the table layout for this one? Done. Let me know if that's better. Line 1717: "distribute by hash(x) into 3 buckets, range(y) split rows " + > can we distribute by range, hash? No, range needs to be the last one. Fails during parsing. Added another parser test. Line 1723: AnalyzesOk("create table tab (a int, b int, c int, d int, primary key (a, b, c))" + > What happens if I specify PARTITION BY for a Kudu table? You get an error but the error was saying that no DISTRIBUTE BY is used. Fixed it to say that PARTITIONED BY is not allowed for Kudu tables. Also added test. Done Line 1739: AnalysisError(String.format("create table tab (x int) distribute by hash (x) " + > What about specifying the other params in tblproperties like the distribute Well these were never specified in TBLPROPERTIES so I am not sure if we should be checking for any weird behavior wrt to properties set there. Besides they will be ignored and/or an error message will be thrown if some crucial clause is missing. Let me know if you have some specific example in mind. Line 1748: AnalysisError("create table tab (x int primary key, primary key(x)) stored " + > Add negative test where two ColumnDefs are declared as PK. Done Line 1762: AnalysisError("create table tab (a int, b int, c int, d int, primary key(a, b, c)) " + > Do we have tests somewhere for checking which types are supported with Kudu We do in kudu_create.test and in test_kudu.py. Added a few more in kudu_create.test. Let me know if there is anything else missing. Line 1772: // No float split keys > Even if the column type is float? If so, might want to test that. float is not allowed as a pk column. We have test for that in kudu_create.test. Line 1810: // DISTRIBUTE BY is required for managed tables. > Primary key is also required, do we have a test? Ha, interestingly not :). Done Line 1812: "Table partitioning must be specified for managed Kudu tables."); > Let's rephrase this as "Table distribution must be specified ..." Done PS6, Line 1828: AnalysisError("create external table t tblproperties (" + > Not that it makes any sense, but hat happens with: That results in parsing error. Let me know if you want me to add a test in ParserTest. Line 1863: public void TestCreateAvroTest() { > Add a test with some of the Kudu-specific clauses and STORED AS AVRO Done Line 2822: public void TestShowFiles() throws AnalysisException { > DO we have a TODO/JIRA somewhere to go through all DDL/SHOW commands and ma We have some tests for SHOW CREATE TABLE in test_kudu.py but we need a JIRA for show partitions. http://gerrit.cloudera.org:8080/#/c/4414/6/fe/src/test/java/org/apache/impala/analysis/ParserTest.java File fe/src/test/java/org/apache/impala/analysis/ParserTest.java: Line 1632 > Why remove this? It will break my setup :) Sorry, I had issues with this on when running parser tests only. Done Line 2235: ParsesOk("CREATE TABLE foo (i INT PRIMARY KEY, PRIMARY KEY(i)) STORED AS KUDU"); > Add case like this: Done Line 2541: ParsesOk("CREATE TABLE Foo TBLPROPERTIES ('a'='b', 'c'='d') AS SELECT * from bar"); > nit: let's be consistent with how we chop up string literals across lines ( Done http://gerrit.cloudera.org:8080/#/c/4414/6/fe/src/test/java/org/apache/impala/catalog/HdfsStorageDescriptorTest.java File fe/src/test/java/org/apache/impala/catalog/HdfsStorageDescriptorTest.java: Line 58: "org.apache.impala.hive.serde.ParquetInputFormat", > revert Done http://gerrit.cloudera.org:8080/#/c/4414/6/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: Line 203: if row_format and file_format == 'text': > I think row_format is also valid for sequence files (maybe we're not using Hm, I couldn't find anything in the docs or in our tests. http://gerrit.cloudera.org:8080/#/c/4414/6/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: Line 823: distribute by range(id) split rows ((1003), (1007)) stored as kudu; > ah so much better Done http://gerrit.cloudera.org:8080/#/c/4414/6/testdata/workloads/functional-query/queries/QueryTest/create-table-like-table.test File testdata/workloads/functional-query/queries/QueryTest/create-table-like-table.test: Line 218: create table kudu_table_clone like functional_kudu.alltypes > Why can't we check this in analysis? Done http://gerrit.cloudera.org:8080/#/c/4414/6/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test: Line 3: # This test file contains several cases for what basically amount to analysis errors, > Should mention that some of this behavior is intentional and why. Done Line 11: ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu > Do we get the same message for DECIMAL and complex types? (No need to add a Yes, added tests. Done Line 14: create table t primary key (id) distribute by hash (id) into 3 buckets > Add test for creating and querying a table that has Impala keywords as the Yeah you were right about using getIdentSql(). This doesn't work as expected. Fixed it and added a test. http://gerrit.cloudera.org:8080/#/c/4414/6/tests/common/kudu_test_suite.py File tests/common/kudu_test_suite.py: PS6, Line 47: super(KuduTestSuite, cls).setup_class() : if os.environ["KUDU_IS_SUPPORTED"] == "false": : pytest.skip("Kudu is not supported") > Maybe move L47 to L50 (switch the ordering of these)? This isn't just a nit Done PS6, Line 51: @classmethod : def teardown_class(cls): : pass > Not needed: Done http://gerrit.cloudera.org:8080/#/c/4414/6/tests/custom_cluster/test_kudu.py File tests/custom_cluster/test_kudu.py: PS6, Line 20: from kudu.client import INT32 > ImportError: I think this needs to be kudu.schema Done PS6, Line 34: @classmethod : def add_test_dimensions(cls): : super(CustomClusterTestSuite, cls).add_test_dimensions() : cls.TestMatrix.add_constraint(lambda v: : v.get_value('table_format').file_format == 'parquet' and : v.get_value('table_format').compression_codec == 'none') > I don't think this is needed: no tests here use a vector. Done PS6, Line 43: def test_kudu_master_hosts(self, cursor, kudu): > kudu_client, not kudu. Note this isn't just a rename; it's to make use of t Done PS6, Line 45: with self.temp_kudu_table(kudu, [INT32]) as kudu_table: > kudu_client Done PS6, Line 50: props) > SyntaxError: missing a closing paren here Done http://gerrit.cloudera.org:8080/#/c/4414/6/tests/query_test/test_kudu.py File tests/query_test/test_kudu.py: PS6, Line 27: from pprint import pprint > Remove: unused import Done PS6, Line 30: from copy import copy : : from tests.beeswax.impala_beeswax import ImpalaBeeswaxException : from tests.common.impala_test_suite import ImpalaTestSuite : from tests.common.skip import SkipIf : from tests.common.test_dimensions import create_uncompressed_text_dimension : from tests.common.test_vector import TestDimension > All imported but not used. Done PS6, Line 49: def test_kudu_scan_node(self, vector): > This test is racy in exhaustive: two tests will be competing for the same d Thanks. Done Line 57: def test_insert_update_delete(self, vector, unique_database): > Can we keep the .test file name and the python test function consistent? i. Done http://gerrit.cloudera.org:8080/#/c/4414/7/tests/query_test/test_kudu.py File tests/query_test/test_kudu.py: Line 192: 'show_create_sql' can be templates that can be used with str.format(). format() > In our other SHOW CREATE TABLE tests we also try to run execute the output See comment below for SPLIT ROWS. Unfortunately, the output is not always executable. To be fixed in Kudu. Line 204: CREATE TABLE {table} (c INT PRIMARY KEY) > add comment to the PK column That's interesting. Actually, currently don't work because we can't persist them in Kudu and the first time the schema is loaded from Kudu we override the HMS column defs, so the comments are lost. So, option 1 is to store them in tblproperies and option 2 to ignore and/or throw error if specified for Kudu tables. Thoughts? Line 213: TBLPROPERTIES ('kudu.master_addresses'='{kudu_addr}')""".format( > Why include these TBLPROPERTIES? Good question. If the user doesn't specify the kudu master_addresses we populate tblproperties with the known values. The issue is that we can't tell if the user added this property or it was generated. I chose to always display this property but I can change it if you don't like it. Line 226: DISTRIBUTE BY HASH (c) INTO 3 BUCKETS, RANGE (c) SPLIT ROWS (...) > is the SPLIT ROWS (...) legal syntax? No, but Kudu's API doesn't currently return any information about range partitions except the column names used. -- To view, visit http://gerrit.cloudera.org:8080/4414 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Michael Brown <mi...@cloudera.com> Gerrit-HasComments: Yes