David Ribeiro Alves has posted comments on this change. Change subject: Review Only: Kudu Impala integration ......................................................................
Patch Set 6: (94 comments) http://gerrit.cloudera.org:8080/#/c/1403/6/be/src/exec/kudu-scan-node.h File be/src/exec/kudu-scan-node.h: Line 124: /// Protects access to state accessed by scanner threads, such as 'status_' or 'done_'. > if done_ is protected by the lock, then why is it volatile? done_ is actually sometimes not accessed within the lock (mostly for reads). this is legacy from nong's original impl, but I think it makes sense. Line 128: /// occurs in a scanner. > Please say: "Protected by lock_" here and any other members that are protec added thsi here, on num_active_scanners_ and on max_materialized_row_batches_ http://gerrit.cloudera.org:8080/#/c/1403/6/be/src/exec/kudu-util.cc File be/src/exec/kudu-util.cc: Line 112: to_lower_copy(schema.Column(i).name()) > this one seemed to get lost -- should use lower_case_col_name variable rath Done http://gerrit.cloudera.org:8080/#/c/1403/6/be/src/exec/kudu-util.h File be/src/exec/kudu-util.h: Line 42: IdxByLowerCaseColName* map); > this was lost too -- fix indentation. Done http://gerrit.cloudera.org:8080/#/c/1403/6/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: Line 185: // Indicates whether this is a Kudu column, If true implies all following Kudu specific > "column. " Done http://gerrit.cloudera.org:8080/#/c/1403/6/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: Line 193: // The timestamp on which to perform the scan for the query. > rephrase, the query might have multiple scan nodes. "The timestamp as of wh Done Line 198: 11: optional i64 timestamp = -1; > "timestamp" is too generic (timestamp of what?) changed this to snapshot timestamp http://gerrit.cloudera.org:8080/#/c/1403/6/common/thrift/JniCatalog.thrift File common/thrift/JniCatalog.thrift: Line 164: 2: required i32 buckets > num_buckets Done Line 187: 1: required TDistributeType type > type is redundant (by_has_paramn.__isset implies hash, for instance) Done Line 433: // If set, the table is automatically distributed according to this parameter > ". Kudu-only." Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: Line 887: // follow the PARTITION feature of insert from select statements. > don't drop this comment Done Line 664: opt_with_clause:w KW_INSERT KW_OVERWRITE opt_ignore:ignore > i'm not in love with the 'ignore' syntax, it's cryptic (what's being ignore ok, created IMPALA-3145 Line 665: opt_kw_table table_name:table > move to previous line (and generally pack the lines with the syntactic elem Done Line 970: distribute_param_list:distribute tbl_properties:tbl_props > save one line (and apply elsewhere, there are lots of instances like that) Done Line 990: // columns will be added by the query statement during analysis. > update comment; you're passing in partition_cols a few lines downn Done Line 997: > remove blank line Done Line 1077: {: RESULT = DistributeParam.createHashParam(DistributeParam.Type.HASH, cols, buckets); :} > long line (and elsewhere) Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java: Line 70: > remove blank line (generally, try to emulate the prevailing formatting) Done Line 75: > same here Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java File fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java: Line 116: * predicate into literals, if possible. > add "todo: create a more general mechanism and retire this function" Done Line 118: public static BinaryPredicate normalizeAndFoldConstants(BinaryPredicate predicate, > rename to normalizeSlotRefComparison, that makes it clearer what the normal Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/CreateTableLikeFileStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableLikeFileStmt.java: Line 68: fileFormat, location, cachingOp, ifNotExists, tblProperties, serdeProperties, null); > long line Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: Line 96: this.columnDefs_ = Lists.newArrayList(columnDefs); > remove "this." everywhere Done Line 190: for(DistributeParam d : distributeParams_) { > formatting Done Line 355: if (!KuduTable.tableParamsAreValid(getTblProperties())) { > have this function return a more specific error string left a TODO will mark for a follow up. Line 371: for(DistributeParam d : distributeParams_) { > formatting Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/DistributeParam.java File fe/src/main/java/com/cloudera/impala/analysis/DistributeParam.java: Line 50: public static DistributeParam createHashParam(Type t, List<String> cols, > the type is redundant, remove Done Line 58: public static DistributeParam createRangeParam(Type t, List<String> cols, > same here Done Line 77: private final int buckets_; > num_buckets_ Done Line 82: private TDistributeByRangeParam rangeParam_; > comment: set in analyze() Done Line 100: public void analyze(Analyzer analyzer) throws AnalysisException { > this doesn't analyze columns_ (i'm assuming there are restrictions on the d for this and for the relevant comments below, which require a refactor of this method's logic to include the analysis of 'columns_', I would prefer to file a jira/leave a TODO including them and perform the refactor by itself after the merge. Line 106: // Creating the thrift structure here allows to throw meaningful exceptions. > i think what this is trying to say is that by creating the thrift structure Done Line 115: "of projected key columns: %d != %d", splitRow.size(), columns_.size())); > if someone lists a lot of split rows, it can be hard to find the offending Done Line 122: if (num.getType().isDecimal() || num.getType().isFloatingPointType()) { > this should happen during the analysis of columns_. here you should only ch again, postponing 'columns_' analysis post-merge, left a TODO Line 135: throw new AnalysisException(String.format("Split row value: %s (Type: %s) is not " > long line. how about "Split row value is not supported: %s (Type %s)" (that Done Line 161: return String.format("HASH(%s) INTO %d BUCKETS", Joiner.on(", ").join(columns_), buckets_); > long line. please fix those everywhere before resubmitting. Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/FromClause.java File fe/src/main/java/com/cloudera/impala/analysis/FromClause.java: Line 73: // All tableRefs have been analyzed, but at least one table was found missing. > was found missing -> is missing metadata Done Line 74: // There is no reason to proceed with analysis past this point. > remove this sentence, it only made sense in the context of SelectStmt Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java File fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java: Line 665: public DataSink createDataSink() { > optionally rename to createTableSink and return a TableSink the correct way to to this seems like it would be to move the createdatasink logic into tablesink (since it only creates tablesinks). don't want to go into refactors of existing code in a merge patch leaving a TODO Line 668: return DataSink.createDataSink(table_, partitionKeyExprs_, overwrite_, ignoreDuplicates_); > long line Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/ModifyStmt.java File fe/src/main/java/com/cloudera/impala/analysis/ModifyStmt.java: Line 87: // On tables with a primary key, ignore key not found errors. > why are those considered errors? in sql, an update/delete with a where clau not sure what was the reasoning behing the original approach, but the point here is that updates that get sent to kudu but can't update anything return an error. this indicates whether the error should be ignored. you might be right that we should reverse the behavior but that is a major refactor that I dont' want to include in a merge patch. I can create a jira though, created IMPALA-3145 Line 99: > add a createTableSink() you mean a createDataSink()? see below (also see comment in InsertStmt) Line 192: for(int i = keyColumnsOffset; i < srcStmt.resultExprs_.size(); ++i) { > formatting Done Line 242: rhsExpr.analyze(analyzer); > move right below l239 Done Line 243: lhsSlotRef.analyze(analyzer); > similar Done Line 251: lhsSlotRef.toSql(), > pack line more tightly Done Line 269: "reference", lhsSlotRef.toSql(), rhsExpr.toSql())); > indentation Done Line 273: throw new AnalysisException(format("Key column '%s' cannot be updated.", > single line, if it fits doesn't Line 279: format("Duplicate value assignment to column: '%s'", > single line Done Line 288: c.getName(), c.getType().toSql(), rhsExpr.toSql(), > single line Done Line 299: > remove blank lines Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/StmtRewriter.java File fe/src/main/java/com/cloudera/impala/analysis/StmtRewriter.java: Line 57: throw new AnalysisException("Update performs internal rewrite only."); > if this is an internal error, it should be a checkState() check comment above Line 57: throw new AnalysisException("Update performs internal rewrite only."); > cryptic error message - what does this mean? also, why did this become an e this is something that I fixed but got lost and was reverted to the original. in the first version of the merge (and now again) update and delete rewrite the subqueries as any other statement http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/analysis/UpdateStmt.java File fe/src/main/java/com/cloudera/impala/analysis/UpdateStmt.java: Line 45: public UpdateStmt(List<String> targetTablePath, > fill up lines Done Line 63: return KuduTableSink.createUpdateSink(table_, referencedColumns_, > single line Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/catalog/KuduColumn.java File fe/src/main/java/com/cloudera/impala/catalog/KuduColumn.java: Line 31: this.isKey_ = isKey; > remove this. Done Line 35: /** > remove comments for getters and blank lines between them Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java File fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java: Line 59: // Alias to the string key that identifies the storage handler for tables. > "for Kudu tables" Done Line 75: // Kudu specific value for the storage handler table property keyed by KEY_STORAGE_HANDLER. > long line Done Line 135: && KUDU_STORAGE_HANDLER.equals(mstbl.getParameters().get(KEY_STORAGE_HANDLER)); > shouldn't just this one be sufficient? Done Line 145: throw new TableLoadingException(String.format("Kudu tables must have at least one key " > long line Done Line 163: if (Sets.intersection(columnNames, keyColumns).size() < keyColumns.size()) { > isn't this sufficient: if (!columnNames.containsAll(keyColumns)) Done Line 186: String keyColumnsProp = Preconditions.checkNotNull(msTbl.getParameters().get(KEY_KEY_COLUMNS) > long line (please look for those and fix all of them) Done Line 187: .toLowerCase(), "'kudu.key_columns' cannot be null."); > never mind. ok Line 187: .toLowerCase(), "'kudu.key_columns' cannot be null."); > we only use Preconditions for internal logic errors, not expected user erro ok http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/catalog/delegates/DdlDelegate.java File fe/src/main/java/com/cloudera/impala/catalog/delegates/DdlDelegate.java: Line 52: public DdlDelegate setDistributeParams(List<TDistributeParam> p) { > how this is abstract? (this only applies to kudu) what do you mean, the class? I didn't implement this but it seems that there is one for kudu and one for everything else and both inherit from this class http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/planner/DataSink.java File fe/src/main/java/com/cloudera/impala/planner/DataSink.java: Line 61: boolean overwrite, boolean ignoreDuplicates_) { > no trailing _ Done Line 76: return KuduTableSink.createInsertSink(table, ignoreDuplicates_); > what about updates? Refactored this do do all operations http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java File fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java: Line 73: analysisResult.isUpdateStmt() || analysisResult.isDeleteStmt()) > stick with existing formatting: Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 68: private final Set<Integer> hostIndexSet_ = Sets.newHashSet(); > explain Done Line 84: // Extract predicates that can evaluated by Kudu. > can be Done Line 114: List<LocatedTablet> tabletLocations = > i'm hoping this is cached in the client. this is, but we create a new client above, so we do this rpc everytime. we need to work on making the client survive across queries Line 154: double baseSelectivity = super.computeSelectivity(); > take a look at what PlanNode.computeSelectivity() does now (we removed the I don't want to tinker with the selectivity computation on the merge patch. particularly because this should be done properly, with a test. created a jira ( IMPALA-3148) and left a TODO Line 174: cardinality_ = Math.max(1, cardinality_); > and also not > numRows Done Line 221: * of the result can be evaluated with Expr::GetValue(NULL)). > doesn't describe side effects on conjuncts param, which happens to be conju done for the most part. the side effect (that the predicates are no longer in conjuncts_) is likely clear from the work "extract" right? Line 273: private static BinaryPredicate transformExclusiveIntLiteralPredicatesToInclusive( > really awkward name; rename (normalizeIntLiteralComparison?) Done Line 284: if (intValue >= Byte.MAX_VALUE) return comparisonPred; > if it's >= max_value, can't you safely substitute this predicate with 'fals left a TODO to do this. http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/planner/KuduTableSink.java File fe/src/main/java/com/cloudera/impala/planner/KuduTableSink.java: Line 33: * data from a plan fragment into an Kudu table using the Kudu client. > "a Kudu" Done Line 42: private ArrayList<Integer> targetColdIdxs_; > why "Cold"? (as opposed to Col) Done Line 55: public static KuduTableSink createInsertSink(Table targetTable, > get rid of these static functions refactored all of this TableSink.create Line 105: public enum Type { > migrate into TableSink and call this Op Done Line 111: public TTableSinkType toThrift() { return TTableSinkType.KUDU_INSERT; } > similar changes here Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java File fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java: Line 984: handler.dropTable(); > do we have a handler for hbase? we have the NoOpDelegate for everything non-kudu Line 1267: // Forward the opeation to a specific storage backend. If the operation fails, > spelling Done Line 1306: * available for the table, returns a NoOpDelegate instance. > i'd assume that whatever we do for kudu should be mirrored for hbase. agreed, probably not here though. Any specific action item you're recommending? http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/main/java/com/cloudera/impala/util/KuduUtil.java File fe/src/main/java/com/cloudera/impala/util/KuduUtil.java: Line 53: public static boolean compareSchema(Table hiveTable, KuduTable kuduTable) > msTable instead of hiveTable (apply throughout this function) Done Line 66: if (kuduField == null || fromImpalaType( > change indentation to make this more parseable Done Line 85: public static List<PartialRow> parseSplits(Schema schema, String kuduSplits) > who calls this? KuduDdlDelegate Line 142: private static void setKey(PartialRow key, Type type, JsonArray array, int pos) > comment Done Line 157: private static void setKey(PartialRow key, Type type, TRangeLiteral literal, int pos, > comment Done Line 253: static String ToString(TRangeLiteral l) throws ImpalaRuntimeException { > toString() Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java: Line 1617: "range(a) split rows ((1),('abc'),(3)) " + > this is obviously wrong Done http://gerrit.cloudera.org:8080/#/c/1403/6/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeModifyStmtsTest.java File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeModifyStmtsTest.java: Line 72: AnalyzesOk("delete from functional_kudu.testtbl"); > delete ignore is missing Done http://gerrit.cloudera.org:8080/#/c/1403/6/testdata/workloads/functional-planner/queries/PlannerTest/kudu-delete.test File testdata/workloads/functional-planner/queries/PlannerTest/kudu-delete.test: Line 4: | ignoreKeysNotFoundOrDuplicate: false > we don't use camel case here. also, something more succinct would be nice, a problem here is that the keyword means two different things. for insert it means "ignore duplicate keys" and for update/delete means "ignore missing keys". I refactored the explain string to output different things for different ops and to avoid the double negative (this required flipping the flag, which I did for printing purposes only) -- To view, visit http://gerrit.cloudera.org:8080/1403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I239314acbc8434ef673a3a59d2a82a0338ea5876 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Martin Grund <[email protected]> Gerrit-Reviewer: Silvius Rus <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
