Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 5:


Commit Message:

Line 45:    "KEY" is expected to be common enough that the ident style
> Might also want to mention that "key" is used for nested map types, so it's

Line 51:    on the HMS database and table name. If the database is "default",
> Does this last regarding "default" make the code easier or more complicated
Not sure why the special case. Removing it unless someone objects.

File be/src/service/frontend.cc:

Line 61: DEFINE_string(kudu_master_addrs, "", "Specifies the default Kudu 
master(s). The given "
> Can we make this more consistent with out existing options? For example:

File fe/src/main/java/org/apache/impala/analysis/ColumnDef.java:

Line 176:   static List<String> toColumnNames(Collection<ColumnDef> columnDefs) 
> colDefs

Line 184:   static Map<String, ColumnDef> 
mapByColumnNames(Collection<ColumnDef> colDefs) {
> comment that colDefs is assumed have no duplicate names, or alternatively d

File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

Line 99:           "does not support (%s) file format. Supported formats are: 
> the (%s) file format.

Line 100:           createStmt_.getFileFormat().toString().replace("_", ""),
> what's with this weird "_" replacement?
I believe the intention was to make the file format name from THdfsFileFormat 
consistent with the file format name as specified in a "STORED AS" clause. 
RC_FILE is RCFILE, TEXT is TEXTFILE. So, as you can tell we aren't very 
consistent here. That's also the reason for hardcoding the values in L101. Left 
a TODO in the CatalogObjects.thrift for now if that's ok with you.

Line 101:           "PARQUET, TEXTFILE, KUDU"));
> use SUPPORTED_INSERT_FORMATS instead of hardcoding
See comment above.

File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 80:   private void setColumnDefs(List<ColumnDef> columnDefs) {
> nit: colDefs everywhere

Line 95:   Map<String, String> getTblProperties() { return 
tableDef_.getTblProperties(); }
> is there any significance to not adding public/private to some of these met
Casey started using the default access modifier that essentially gives public 
access only to classes in the same package. In theory, we should be using the 
most restrictive access modifier, so I kind of like this. For sure the codebase 
is not consistent with using this rule, so if you think we shouldn't be doing 
this I'll make them public.

Line 192:           "Only Kudu tables can use DISTRIBUTE BY clause.");
> the DISTRIBUTE BY clause

File fe/src/main/java/org/apache/impala/analysis/TableDef.java:

Line 43:  * Represents the table parameters in a CREATE TABLE statement.
> Let's be more precise and list the clauses that this captures.

File fe/src/main/java/org/apache/impala/catalog/KuduTable.java:

Line 97:   // Table name in Kudu storage engine. It may not neccessarily be the 
same as the
> the Kudu storage engine

Line 101:   // 1. For managed tables, 'kuduTableName_' is prefixed with 
'__IMPALA__<db_name>' to
> Can we make the prefix 'impala::<db_name>' instead?
Sure no prob. Done

Line 154:    * are loaded from HMS. The function also updates the table schema 
in HMS.
> Add a comment about why we also update the HMS. My understanding is that we

Line 195:       msClient.alter_table(msTable_.getDbName(), 
msTable_.getTableName(), msTable_);
> This means every invalidate/refresh will do an extra alter to HMS. Are we c
Good point. We can potentially avoid this by checking if there has been some 
change in the schema. At this point, I am not so terribly worried about this. 
I'll keep a note to actually profile it and see if it's worth optimizing that 

Line 202:    * Loads the schema from the Kudu table including column 
definitions and primary key
> mention that it loads the schema into this table as well as the HMS table

Line 216:       String colName = ToSqlUtils.getIdentSql(colSchema.getName());
> getIdentSql() may add backticks, I don't think we want that here
Hm, can you explain why this is the case given that we will use this name to 
populate the columns in the HMS table?

Line 223:     numClusteringCols_ = primaryKeyColumnNames_.size();
> Check that there is at least one PK column
Actually, I removed this and left a TODO. There are a few places that interpret 
numClusteringCols_ in a way that is not consistent with Kudu tables. For 
example, the analysis in InsertStmt is using this value to determine if there 
is a result expr in the associated select stmt for every non-partitioned and 
partitioned column of the target table.

Line 228:     PartitionSchema partitionSchema = kuduTable.getPartitionSchema();
> Preconditions.checkState(!colsByPos_.isEmpty()); ?

Line 254:   public static KuduTable createTmpTable(Db db,
> createCtasTarget()? Seems more explicit since there really is only one use.

File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

Line 1117:     if (db != null) dropTablesFromKudu(db);
> Won't this drop all Kudu tables even if CASCADE is not specified?
Oops. You're right. Done

Line 1151:     // the metadata must be fetched from the Hive Metastore.
> Why do we need to load from HMS? See comments below.
In order to drop a table from Kudu we need the Kudu table name which is stored 
in tblproperties. Let me know if it's clear or if I should leave a comment.

Line 1167:         throw new ImpalaRuntimeException(
> This seems like weird behavior. The user issued a cascading drop db, but ge
Makes sense. Changed it to log the error and continue.

Line 1172:       if (!KuduTable.isKuduTable(msTable)) continue;
> Why do we need this check? Won't Kudu tell us if the table does not exist? 
Well, it's an easy check we can do to avoid RPC calls to Kudu for dropping 
non-Kudu tables. msTables contains all the tables of a database.

Line 1438:     StorageDescriptor sd = HiveStorageDescriptorFactory.createSd(
> should we consolidate all the SD stuff into one createSd() call?

Line 1453:     if (params.getPartition_columns() != null) {
> for my understanding: these are empty for Kudu tables right?

File fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java:

Line 41: //import org.apache.hadoop.hive.metastore.api.Table;
> ?

Line 53:  * such as creating and droping tables from Kudu.
> typo: dropping

Line 58:   public final static ColumnDef REQUIRED_THRIFT_COLUMN_DEF;
> Isn't it enough to set an empty list or mark the thrift field as set?
Yeah, I don't know why this was added in the first place. Removed.

Line 70:   static void createTable(org.apache.hadoop.hive.metastore.api.Table 
> createManagedTable()? and then mention that msTbl must be a managed table

Line 85:       kudu.createTable(kuduTableName, schema, tableOpts);
> May this throw if the table already exists? If so it might be cleaner to re
I see what you're saying but I don't think this will work. We need to be able 
to handle the IF NOT EXISTS clause and there doesn't seem to exist a subclass 
of KuduException to indicate that createTable() failed because the table 
already exists. So, I can't find a clean way to differentiate between the error 
cases. What's needed here is Kudu API to take IF NOT EXISTS as parameter.

Line 97:       throws ImpalaRuntimeException {
> remove throws?
Hm, fromImpalaType throws so I don't think we can remove it here.

Line 121:       TCreateTableParams params, Schema schema) throws 
ImpalaRuntimeException {
> remove throws?
Same here, KuduUtil.parseSplits throws ImpalaRuntimeException.

Line 155:       int r = Integer.parseInt(replication);
> catch NumberFormatException?

Line 170:     if (Table.isExternalTable(msTbl)) return;
> Can we make this a Precondition and have the caller check it?

Line 178:       if (kudu.tableExists(tableName)) {
> same comment about non-atomic exists() + drop() calls
I get it. See my comment above and let me know what you think.

Line 214:         String colName = ToSqlUtils.getIdentSql(colSchema.getName());
> I don't think we want getIdentSql() here
I'd appreciate it if you could explain why is it wrong to use this here and/or 
what is the desired behavior in terms of column name.

Line 239:     } catch (Exception e) {
> I think we should be more nuanced in our exception handling here, in partic
Good point. Done

File fe/src/main/java/org/apache/impala/util/KuduUtil.java:

Line 198:         metastoreTableName : "__IMPALA__" + metastoreDbName + "." + 
> make the prefix a static constant (and change it to "impalad::" please)

Line 240:       case BOOL: return Type.BOOLEAN;
noop :)

To view, visit http://gerrit.cloudera.org:8080/4414
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to