Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 1:


Commit Message:

PS1, Line 21: When attempting to create an external
            :   table "foo" in database "bar", Impala will search for a Kudu 
            :   name "foo.bar" and "bar" (Kudu doesn't have database name spaces
            :   yet.)
> I think this sentence is duplicated by #10. Maybe remove this part. This it

PS1, Line 25:  The Kudu table is now required to exist at the time of creation 
            :   Impala.
> for external tables? or if this is more general can you explain a bit more?

PS1, Line 31: default value
> can it still be overridden?

PS1, Line 39: Eventually Hive should have the needed
            :   functionality and the Kudu delegate (renamed in this patch to 
            :   can be removed.
> maybe, it's not clear if impala would want to keep the ability to do this o

PS1, Line 54:  If the database is "default" then
            :   the Kudu table name will not include the database.
> hm... I wonder if it'd be easier just to keep it standardized and always ha

PS1, Line 56: 11) Several improvements in the grammar related to the family
            :   of CREATE TABLE statements.
> I don't think this bullet adds much value to the commit message. Everything

PS1, Line 58: 12) Added new tests and modified existing Kudu test to use the new
            :   CREATE TABLE syntax.
> I think this can be removed too, it should be implied there are tests and i

PS1, Line 60: insteads
> instead

PS1, Line 60: source of truth
> Does this mean we always hit Kudu for metadata? In the Catalog? In the impa

PS1, Line 64: Not included in this commit:
            : - Additional column properties such as nullability and compression
            : encodings.
> I don't think you need to include this, we have a separate JIRA for this.

File be/src/catalog/catalog.cc:

PS1, Line 47:     {"<init>", "(ZILjava/lang/String;IIZLjava/lang/String;)V",
            :       &catalog_ctor_},
> 1 line

File be/src/service/frontend.cc:

PS1, Line 37: // XXX: This flag doesn't seem to be used anywhere. Maybe remove 
            : DEFINE_bool(load_catalog_at_startup, false, "if true, load all 
catalog data at startup");
> agreed. doesn't look like it's used by JniFrontend. Let's remove it.

PS1, Line 63: IPs
> IP addresses.

PS1, Line 63:  
> no space

File common/thrift/CatalogObjects.thrift:

PS1, Line 52: THdfsFileFormat
> Maybe not for this patch since it's already huge, but it'd be great to gene
I agree we need to revisit this. Left a TODO for now.

PS1, Line 61: THdfsCompression
> similarly, this seems unnecessarily specific to Hdfs. Not necessarily somet
Left TODO.

File fe/src/main/cup/sql-parser.cup:

> FYI I'm not looking at this very carefully because it was reviewed previous
Only few things changed mostly wrt specifying the PRIMARY KEY keyword  inside 
the column definition list.

PS1, Line 31: ColumnDefOptions
> this class doesn't exist, please add the missing file

PS1, Line 405:   view_column_def_list, view_column_defs;
> separate nonterminals if they dont fit?

PS1, Line 458: key_ident;
> does this not need Boolean like those below?

File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

> I'm not in love with the TableDef and TableDefOptions stuff because it feel
As we spoke offline, the initial attempt to refactor this logic didn't succeed. 
I will not change these classes but I'll make an effort to separate the 
Kudu-specific logic at the function level.

PS1, Line 240: 
> is this needed? I thought there was going to be code that fetches the colum
We do, but that happens in the catalog. This was added to satisfy the thrift 

PS1, Line 242:       if 
(!getTblProperties().containsKey(KuduTable.KEY_TABLE_NAME)) {
             :         getTblProperties().put(KuduTable.KEY_TABLE_NAME,
             :             KuduUtil.getDefaultCreateKuduTableName(getDb(), 
             :       }
> hm... can we simplify this and only allow internal tables to be created wit
This is executed only for managed tables. Is this what you meant? Also, I 
changed the prefix to be __IMPALA__db.table;

PS1, Line 275:         // Kudu's data distribution rules are enforced here. A 
good reference is
             :         // 
http://getkudu.io/docs/schema_design.html#data-distribution. As a summary:
             :         // * Only primary key column can be used in distribution 
             :         // * Distributions may be either hash or ranged (value 
             :         // * Multiple distributions can be used/mixed
             :         //   * A column can only be used in one hash distribution
             :         //   * A range distribution can only be used once and 
must be the last definition
             :         //     (both enforced by the grammar).
> I don't think we should get in the business of validating the Kudu rules si
Actually, I don't find it such a terrible idea. The sooner we figure out 
invalid params the better it is for the entire system, i.e. we don't have to 
make an rpc to the catalog to make an rpc to Kudu in order to figure out that a 
non-primary key column was used in a distribution. Thoughts?

PS1, Line 323: throwIfPrimaryKeysWereSpecified
> I think the callers can throw, simpler to have a fn to check: bool hasPrima
I responded below. If you don't like it we can change to check + throw in the 
caller. I don't mind either options.

PS1, Line 327: isPrimaryKey
> when will we have PKs from the col defs but not in the tableDef_.getPrimary
We can have PKs from the col defs in the following case:
CREATE TABLE foo (a int PRIMARY KEY)... Even though the callers of this 
function correspond to different use cases (external vs non-kudu) the check is 
similar, i.e. that no PK was specified anywhere in the CREATE TABLE stmt. I am 
ok with this but if you don't like it I can change it.

PS1, Line 327:         if (colDef.isPrimaryKey()) {
             :           hasKey = true;
             :         }
> 1line

PS1, Line 335: throwIfNotPrimaryKeyType
> I only see this called once, please inline it

File fe/src/main/java/com/cloudera/impala/analysis/DistributeParam.java:

PS1, Line 84: BigDecimal
> why BigDecimal? Ultimately this has to resolve to some int for kudu's API. 
I believe because that's what the Parser generates. I changed it to int.

PS1, Line 110: <= 1)
> is 1 bucket actually not valid?
Yes, I verified it using the Kudu client.

PS1, Line 138: colType.isStringType() && !exprType.isStringType()
             :               || colType.isIntegerType() && 
             :                   || exprType.getPrecision() > 
> 1. I don't see anything in the Kudu client that explicitly says you can't p
There is a restriction on the types of primary keys (no bool, float or double) 
which I verified using the Kudu client; it is also mentioned in Kudu docs (see 
http://kudu.apache.org/docs/schema_design.html#primary-keys). Consequently, you 
can't partition on these types of columns.

In any case, these checks here are for ensuring that the types of split value 
literals can be mapped to the types of the partition columns. Do you think we 
should change the way we validate this?

File fe/src/main/java/com/cloudera/impala/analysis/TableDefOptions.java:

PS1, Line 32: Represents the end of a CREATE TABLE statement. TableDef 
represents the beginning of
            :  * the statement.
> I find this odd. The 'beginning/end of the statement' feels like a parser i
I merged these two classes into TableDef and updated the parser. Let me know 
what you think.

File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java:

PS1, Line 207: HdfsFileFormat.KUDU
> blegh this looks like an oxymoron. I wonder if we can rename this class to 
Yeah, I agree. I left a TODO in the thrift file to change it.

PS1, Line 212: keyCols != null
> why would this be null?

PS1, Line 217: kuduTableName != null
> why would this be null?

PS1, Line 217:       if (kuduTableName != null
             :           && 
             :               table.getDb().getName(), table.getName()))) {
             :         properties.remove(KuduTable.KEY_TABLE_NAME);
             :       }
> might just be cleaner to leave this in and not have to have 2 cases to test

PS1, Line 223: Preconditions.checkNotNull(kuduDistributeByParams);
> This breaks for tables that were created before this change.
Good point. Removed.

File fe/src/main/java/com/cloudera/impala/catalog/HdfsFileFormat.java:

PS1, Line 39: HdfsFileFormat
> Can you open a JIRA (and leave it in the comment) to refactor this later?

File fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java:

PS1, Line 222: 
> was this moved somewhere else? Maybe I'll find it as I keep reading...
Added necessary checks in the load() function.

PS1, Line 79: 
            :   // TODO we should have something like 
> do you know what this means / can it be removed now that we're adding defau
Not clear to me. I removed it.

PS1, Line 87: 
            :   // Key to specify the number of tablet replicas.
            :   // TODO(KUDU): Allow modification in alter table.
            :   public static final String KEY_TABLET_REPLICAS = 
> After looking a bit more, I see that this is the way the user specifies the
Yeah, as you pointed out, it would be weird from a usability point of view to 
strip something that the user specified in tblProperties. I removed stuff that 
should be in tblProperties such as the primary key columns. Also removed the 
TODO. Wrt CREATE TABLE LIKE semantics, let's talk offline.

PS1, Line 87: 
            :   // Key to specify the number of tablet replicas.
            :   // TODO(KUDU): Allow modification in alter table.
            :   public static final String KEY_TABLET_REPLICAS = 
> I don't think we need to store this, even if we wanted to modify this later

Line 92:   public static final String KEY_DISTRIBUTE_BY = "kudu.distribute_by";
> comment

PS1, Line 92:   public static final String KEY_DISTRIBUTE_BY = 
> I'm not sure we should be storing this. The Kudu client exposes KuduTable.g
Agree. I wasn't aware of the getPartitionSchema() API. I changed the code to 
use that instead.

Line 153:    * Load the columns from the schema list
> can you add a comment about error handling in this function?

PS1, Line 158:       LOG.error(String.format("Kudu tables must have at least 
             :           + "key column (had %d), and no more key columns than 
there are table columns "
             :           + "(had %d).", keyColumns.size(), schema.size()));
> shouldn't this still fail? if not, can you add a comment why this continues

PS1, Line 184:       LOG.error(String.format("Some key columns were not found 
             :               + " the set of columns. List of column names: %s, 
List of key column names:"
             :               + " %s", Iterables.toString(columnNames), 
> why do we continue?

PS1, Line 199:     // Get the table metadata from Kudu
             :     if (reuseMetadata) {
> I'm confused about this. It's not clear to me from the name 'reuseMetadata'

PS1, Line 238: // Update the HMS
             :     if (reuseMetadata) {
             :       try {
             :         client.alter_table(msTable_.getDbName(), 
msTable_.getTableName(), msTable_);
             :       } catch (TException e) {
             :         throw new TableLoadingException(e.getMessage());
             :       }
             :     }
> Same deal, I'm not sure why reuseMetadata means update the HMS.

File fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java:

PS1, Line 220: 
             :   private final static KuduCatalogOpExecutor kuduExecutor_
             :       = new KuduCatalogOpExecutor();
> all static methods, why instantiate it?
Correct. Removed.

PS1, Line 1122:  // The db == null case isn't handled. The only tables this 
should matter for are
              :     // Kudu tables. The expectation is Impala will always know 
about any Kudu tables
              :     // because Hive doesn't support Kudu yet. 
> I've always had a problem with this comment :/ even after trimming it down 

PS1, Line 1124: When Hive supports Kudu the DDL delegates
              :     // can be removed. 
https://issues.cloudera.org/browse/IMPALA-3424 tracks the removal.
> Can you remove these sentences and just leave IMPALA-3424 as a reference? W

PS1, Line 1127: ,
> nit: extra comma

PS1, Line 1150:         // The operation will be aborted if the Kudu table 
cannot be dropped. If for
              :         // some reason Kudu is permanently stuck in a 
non-functional state, the user is
              :         // expected to ALTER TABLE to either set the table to 
UNMANAGED or set the format
              :         // to something else.
> JIRA? Even if we don't fix it we should have something public to point to r
Not sure what the JIRA should say in this case. I'll look into the tests and 
see if I can create a more specific scenario which can be documented.

PS1, Line 1205: kuduExecutor_
> static ref

PS1, Line 1406: if (KuduTable.isKuduTable(tbl)) {
              :       return createKuduTable(tbl, params.if_not_exists, 
              :           response);
              :     }
              :     return createTable(tbl, params.if_not_exists, 
params.getCache_op(), response);
              :   }
> it's too bad we have to just branch like this and handle everything differe

PS1, Line 1420:  createMetaStoreTable
> I'd prefer to have this on the prev line and the parameter on the new line

Line 1475:    * creation of a managed Kudu table.
> comment on response param

PS1, Line 1480: TDdlExecResponse response)
> I think this just fits on the prev line

PS1, Line 1483: kuduExecutor_
> static reference KuduCatalogOpExecutor, here elsewhere in this fn

PS1, Line 1509:       // Add the table to the catalog cache
              :       Table newTbl = catalog_.addTable(newTable.getDbName(), 
              :       addTableToCatalogUpdate(newTbl, response.result);
> While I don't think these will throw, it might be worth wrapping all the lo
Let me think about this a bit more. In theory, in this case the user should be 
able to recover by doing a REFRESH on the table.

File fe/src/main/java/com/cloudera/impala/service/Frontend.java:

PS1, Line 230: impaladCatalog_.getDefaultKuduMasterAddrs()
> this is fine but a little weird, normally I'm all about removing extra stat

File fe/src/main/java/com/cloudera/impala/service/JniCatalog.java:

PS1, Line 83: 
            :       int otherLogLevel, boolean allowAuthToLocal, String 
> wrapping

File fe/src/main/java/com/cloudera/impala/service/KuduCatalogOpExecutor.java:

PS1, Line 29: import org.apache.hadoop.hive.metastore.api.T
> since this file references our Table class we should import ours and refere

PS1, Line 45: yet
> remove

PS1, Line 45: When Hive
            :  * functionality is available, that should be preferred to the 
functionality here.
            :  * https://issues.cloudera.org/browse/IMPALA-3424 tracks this.
> remove text since we might not. JIRA can stay but is actually a different i

PS1, Line 65: com.cloudera.impala.catalog.Table.
> shouldn't need to ref the full namespace

Line 149:    * Reads the schema from a Kudu table and populates 'msTbl' with an 
equivalent schema.
> If an error occurs we may have partially modified the table and leave it in

PS1, Line 150: if unable to do so.
> if any errors are encountered.

PS1, Line 155:       // Schemas for external tables are not specified by the 
user in DDL. Instead the
             :       // user provides a Kudu table name (or uses implicit Hive 
Metastore to Kudu mapping
             :       // rules), then the schema is imported from Kudu.
> I'm not sure if this is adding much and I think it's a bit confusing/out of

PS1, Line 162: // Start searching....
> remove

PS1, Line 166: if (!Strings.isNullOrEmpty(dbName) && 
!dbName.equalsIgnoreCase("default")) {
             :           // If it exists, the fully qualified name is preferred.
             :           candidateTableNames.add(dbName + "." + tableName);
             :         }
> i don't think we should bother handling the default case specially. Can we 

File fe/src/main/java/com/cloudera/impala/util/KuduClient.java:

PS1, Line 30:  * This class wraps an org.apache.kudu.client.KuduClient to 
transform exceptions into
            :  * ImpalaRuntimeExceptions. No additional functionality is 
provided. See the Kudu
            :  * documentation for information about the methods.
> Let's think about whether or not we can avoid this. Obviously this will hav
Yeah, not a big fan of this either. Let's talk about it.

File fe/src/main/java/com/cloudera/impala/util/KuduUtil.java:

PS1, Line 41: org.apache.kudu.
> reference the ext Type and import ours

PS1, Line 191: isSupportedKeyType
> is this their restriction?
yes. Done

PS1, Line 191: com.cloudera.impala.catalog.
> import?

PS1, Line 196:  and the user did
             :    * not provide an explicit Kudu table name
> ... a table, assuming a custom name was not provided.

PS1, Line 201: Catalog.isDefaultDb(metastoreDbName) ?
> you know i'm voting for not special casing this...
I've changed the default behavior for external tables. For managed it may make 
sense to generate a default name. No strong opinion though. If you feel it's 
too confusing, we can change it.

File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java:

PS1, Line 28: import com.cloudera.impala.catalog.KuduTable;
            : import junit.framework.Assert;
> nit: organize imports

File infra/python/deps/requirements.txt:

PS1, Line 67: # kudu-python==0.2.0
> When this changes I think there may be some steps involved in "upgrading" t
I didn't have to do anything. bin/boostrap_toolchain.py checks the versions and 
downloads/builds the new version automatically.

PS1, Line 70:   # These should eventually be removed  
            :   unittest2 == 1.1.0
            :     linecache2 == 1.0.0
            :     traceback2 == 1.4.0
> remove

File testdata/bin/generate-schema-statements.py:

PS1, Line 196: primay
> primary

File testdata/workloads/functional-query/queries/QueryTest/create_kudu.test:

> Also, I don't see any tests for CTAS here or anywhere.
Most of these have been moved elsewhere. I'll add CTAS tests.

> we might still need to have explicit create table tests.
I agree. We do lack integration tests with Kudu. I'll add them in a followup 


> well, we should have some coverage of show create table...
We do in the test_kudu.py (TestShowCreateTable class)


PS1, Line 15: 
> I think this kind of error should still be covered. You can still mismatch 
Covered in analysis tests.

PS1, Line 37: 
> we should still cover this as well.
Covered in analysis tests.

PS1, Line 50: 
> same
Covered in analysis tests.

> Some more questions about losing coverage. Fine if this is now covered by F
Yes, all these cases are covered in the AnalyzeDDLTest.java (analysis tests).

File tests/common/__init__.py:

> does this need to be its own file?
Maybe a more python-savvy person can argue if this is best practice. I assume 
__init__.py  is where you put any initialization code/global vars.

File tests/common/kudu_test_suite.py:

PS1, Line 54:   @classmethod
            :   def get_db_name(cls):
            :     # When py.test runs with the xdist plugin, several processes 
are started and each
            :     # process runs some partition of the tests. It's possible 
that multiple processes
            :     # will call this method. A random value is generated so the 
processes won't try
            :     # to use the same database at the same time. The value is 
cached so within a single
            :     # process the same database name is always used for the 
class. This doesn't need to
            :     # be thread-safe since multi-threading is never used.
            :     if not cls.__DB_NAME:
            :       cls.__DB_NAME = \
            :           choice(ascii_lowercase) + 
"".join(sample(ascii_lowercase + digits, 5))
            :     return cls.__DB_NAME
> I've always disliked this function... I think we should try to use the db f
Duno. I'll check with Michael.

PS1, Line 127:     mapping = {BOOL: "BOOLEAN",
             :         DOUBLE: "DOUBLE",
             :         FLOAT: "FLOAT",
             :         INT16: "SMALLINT",
             :         INT32: "INT",
             :         INT64: "BIGINT",
             :         INT8: "TINYINT",
             :         STRING: "STRING"}
> this gets defined as a local every function invocation, right? how about cr
Hm, well it's only defined once and it is used only in the context of this fn. 
If you insist I'll move it outside of this fn.

File tests/custom_cluster/test_kudu.py:

PS1, Line 25: TestRedaction
> ?
:) Done

PS1, Line 42: temp_kudu_table
> I'd think we could have the test fn take the database fixture and then pass
I'll check with Michael to see what the issue with get _db_name() and if it is 
ok to replace it with db fixture, I'll change it here.

To view, visit http://gerrit.cloudera.org:8080/4414
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to