Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables ......................................................................
Patch Set 1: (28 comments) Ok, here's a bit to get started. Just made my way through most of analysis. Still have a lot to get through but figured I'll give you the feedback I have while I do a few other things. http://gerrit.cloudera.org:8080/#/c/4414/1//COMMIT_MSG Commit Message: PS1, Line 21: When attempting to create an external : table "foo" in database "bar", Impala will search for a Kudu table : name "foo.bar" and "bar" (Kudu doesn't have database name spaces : yet.) I think this sentence is duplicated by #10. Maybe remove this part. This item is about reading the columns, right? The issue of table name resolution is separate IMO, and covered by #10, though consider moving them next together :) PS1, Line 25: The Kudu table is now required to exist at the time of creation in : Impala. for external tables? or if this is more general can you explain a bit more? PS1, Line 31: default value can it still be overridden? PS1, Line 39: Eventually Hive should have the needed : functionality and the Kudu delegate (renamed in this patch to KuduCatalogOpExecutor) : can be removed. maybe, it's not clear if impala would want to keep the ability to do this on its own... Remove? PS1, Line 54: If the database is "default" then : the Kudu table name will not include the database. hm... I wonder if it'd be easier just to keep it standardized and always have the name, i.e. using default in this case. What do you think? I'll see if I change my mind after reading more code. PS1, Line 56: 11) Several improvements in the grammar related to the family : of CREATE TABLE statements. I don't think this bullet adds much value to the commit message. Everything else is more user visible, this is just some necessary code cleanup. PS1, Line 58: 12) Added new tests and modified existing Kudu test to use the new : CREATE TABLE syntax. I think this can be removed too, it should be implied there are tests and it takes away from the billion other "features" in this patch :) PS1, Line 60: source of truth Does this mean we always hit Kudu for metadata? In the Catalog? In the impalad's Catalog? PS1, Line 60: insteads instead PS1, Line 64: Not included in this commit: : - Additional column properties such as nullability and compression : encodings. I don't think you need to include this, we have a separate JIRA for this. http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: FYI I'm not looking at this very carefully because it was reviewed previously https://gerrit.cloudera.org/#/c/2865/ Let me know if this has changed much since then. PS1, Line 405: view_column_def_list, view_column_defs; separate nonterminals if they dont fit? PS1, Line 458: key_ident; does this not need Boolean like those below? http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: I'm not in love with the TableDef and TableDefOptions stuff because it feels like that's parser implementation concerns bleeding into analysis code. Do you think we need both? Can we fold the latter into TableDef? If we have to keep both, it'd be helpful to at least group the functions in this file based on their source (e.g. TableDef, TDOptions, or on this itself) with comments around the groups? Maybe within those groups they could also be sorted by visibility? There's a lot of stuff going on so I think it'd be helpful to keep it tidy. PS1, Line 240: getColumnDefs().add(KuduCatalogOpExecutor.REQUIRED_THRIFT_COLUMN_DEF); is this needed? I thought there was going to be code that fetches the columns from Kudu and stores them. PS1, Line 242: if (!getTblProperties().containsKey(KuduTable.KEY_TABLE_NAME)) { : getTblProperties().put(KuduTable.KEY_TABLE_NAME, : KuduUtil.getDefaultCreateKuduTableName(getDb(), getTbl())); : } hm... can we simplify this and only allow internal tables to be created with the name db_name.table_name? Obviously there could be conflicts and it should fail then. Then it'd be 1 less case to consider. Also, I wonder if we should prefix the underlying kudu table names with something, e.g. IMPALA_db.table? PS1, Line 275: // Kudu's data distribution rules are enforced here. A good reference is : // http://getkudu.io/docs/schema_design.html#data-distribution. As a summary: : // * Only primary key column can be used in distribution definitions : // * Distributions may be either hash or ranged (value intervals) : // * Multiple distributions can be used/mixed : // * A column can only be used in one hash distribution : // * A range distribution can only be used once and must be the last definition : // (both enforced by the grammar). I don't think we should get in the business of validating the Kudu rules since they should reject it. Is there a reason we have to do this ourselves as well? PS1, Line 323: throwIfPrimaryKeysWereSpecified I think the callers can throw, simpler to have a fn to check: bool hasPrimaryKeysSpecified(). Or maybe this isn't necessary? see my commetn below. PS1, Line 327: isPrimaryKey when will we have PKs from the col defs but not in the tableDef_.getPrimaryKeyColumnNames()? Better yet, it looks like the 2 callers of this fn have slightly different use cases so maybe it can be scrapped and handled more specifically. Feel free to disagree... PS1, Line 327: if (colDef.isPrimaryKey()) { : hasKey = true; : } 1line PS1, Line 335: throwIfNotPrimaryKeyType I only see this called once, please inline it http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/java/com/cloudera/impala/analysis/TableDefOptions.java File fe/src/main/java/com/cloudera/impala/analysis/TableDefOptions.java: PS1, Line 32: Represents the end of a CREATE TABLE statement. TableDef represents the beginning of : * the statement. I find this odd. The 'beginning/end of the statement' feels like a parser implementation detail. I'm not sure why this needs to be a separate class. TableDef seems OK, and it already has some things that get set after it is constructed (e.g. partitions), so maybe these things can be moved to TableDef and set via addl functions. http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: PS1, Line 207: HdfsFileFormat.KUDU blegh this looks like an oxymoron. I wonder if we can rename this class to something like StorageFormat. I'll keep an eye out as I read more code. PS1, Line 212: keyCols != null why would this be null? PS1, Line 217: kuduTableName != null why would this be null? PS1, Line 217: if (kuduTableName != null : && kuduTableName.equals(KuduUtil.getDefaultCreateKuduTableName( : table.getDb().getName(), table.getName()))) { : properties.remove(KuduTable.KEY_TABLE_NAME); : } might just be cleaner to leave this in and not have to have 2 cases to test http://gerrit.cloudera.org:8080/#/c/4414/1/infra/python/deps/requirements.txt File infra/python/deps/requirements.txt: PS1, Line 67: # kudu-python==0.2.0 When this changes I think there may be some steps involved in "upgrading" the environment- have you figured out what that is? We'll need to tell people what command(s) to run after fetching this change. PS1, Line 70: # These should eventually be removed https://issues.apache.org/jira/browse/KUDU-1456 : unittest2 == 1.1.0 : linecache2 == 1.0.0 : traceback2 == 1.4.0 remove -- To view, visit http://gerrit.cloudera.org:8080/4414 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-HasComments: Yes
