Alex Behm has posted comments on this change. Change subject: IMPALA-1654: Support general predicates in most partition DDL operations. ......................................................................
Patch Set 9: (60 comments) http://gerrit.cloudera.org:8080/#/c/1563/9/be/src/service/query-exec-state.cc File be/src/service/query-exec-state.cc: Line 530: const TDdlExecResponse* ddl_resp = catalog_op_executor_->ddl_exec_response(); Add a new SetResultSet(TDdlExecResponse*) helper function, and use it here as well as in UpdateTableAndColumnStats(). http://gerrit.cloudera.org:8080/#/c/1563/9/common/thrift/Frontend.thrift File common/thrift/Frontend.thrift: Line 206: struct TShowFilesParams { I know we talked about this before but I'm bringing it up again because we've recently discovered new bugs in this area. I'm concerned about the behavioral changes introduced with this move to partition ids instead of KVs. When we run an "invalidate metadata", the table ids and partition ids get reset, so I'm thinking that most of these operations that use partition ids will fail with concurrent metadata reloads (because the partition ids change every time). Unfortunately, in practice, metadata invalidations are run very frequently, so I'm concerned about users suddenly facing problems after upgrading. I asked Bharath for confirmation and he agrees we should go back to using the TPartitionKeyValue to avoid subtle behavioral regressions under concurrency. http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: Line 1549: // clause in an INSERT statement. Partition clause contains dynamic and static I don't think the partition clause for INSERT is different now. Dynamic partition keys are just a SlotRef expr and static partition keys are <SlotRef> = <Constant> BinaryPredicates, so it is possible to use this clause for INSERT as well (with changes to the analysis logic). Let's leave that for future work because this patch is already substantial enough. Can you add a TODO here for that cleanup? Once this patch goes in we should also file a JIRA to track this extra cleanup. http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java: Line 52: Preconditions.checkState(getPartitionSpec().getPartitionIds().size() == 1); use isEmpty() Line 71: if (partitions.size() != 1) use !isEmpty() http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java: Line 91: // Register this table so we can evaluate partition predicates. Resolve and analyze this table ref so we can evaluate partition predicates. (and change elsewhere) Line 92: tableRef_ = analyzer.resolveTableRef(tableRef_); Add a Preconditions.checkState(tableRef_ instanceof BaseTableRef); http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AnalysisContext.java File fe/src/main/java/com/cloudera/impala/analysis/AnalysisContext.java: Line 435: Preconditions.checkState( I spent a lot of time thinking about the implications of these changes. I think it makes sense to ignore the column-level privilege checks for these statements because we should not allow these operations even if the user has sufficient privileges on all columns (but none on the table). We really want to require table-level privileges and the cleanest way of doing that is to not register the column-level privilege requests in the first place. Otherwise, we'll need a lengthy comment here explaining why we ignore some privilege requests for these statements. Instead, I suggest that you temporarily disable the registration of privilege requests when analyzing the partition exprs in PartitionSpec.analyze(). You can use Analyzer.setEnablePrivChecks() to enable/disable. Then you can remove the changes here. We should also add a test that asserts the correct authorization behavior, i.e., that having all column-level privileges is not sufficient to run a "show files partition", etc. http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/com/cloudera/impala/analysis/ComputeStatsStmt.java: Line 278: tableRef_.analyze(analyzer); Add Preconditions check that tableRef is a base table ref. http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java: Line 200: if (partitionSpec_.getPartitions().size() != 1) { use isEmpty() http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/PartitionSpec.java File fe/src/main/java/com/cloudera/impala/analysis/PartitionSpec.java: Line 51: private TableRef tableRef_; You don't need to set this. Instead you can use: analyzer.getDescriptor(tableName_.toString()); This also makes sure that you are using the correct analyzer. Line 57: // Flag used to tell if it is using the old KV style PartitionSpec because only they Referring to "old" behavior is generally not useful, because somebody reading the comment today may not know the entire history behind this code's evolution. It's best to describe what this flag does. Line 58: // require partition specs in a particular KV form. See PARTITION_EXPR_ERROR." extra " at end Line 85: List partitionIds = Lists.newArrayList(); List<Long> Line 130: if (isKV_) { It seems more appropriate to distinguish between a PartitionSpec and a PartitionSetSpec, instead of having this flag here. The PartitionSpec would be the one used for ADD PARTITION and LOAD DATA, and the PartitionSetSpec would be used for those statements that can work on a set of partitions. What do you think? Line 228: // one partition SlotRef such that conjuncts like "SlotRef <op> Literal" works work Line 229: // while "SlotRef <op> SlotRef" and "Literal <op> Literal" don't. Why these restrictions? We should be able to handle "SlotRef <op> SlotRef" just fine, or any other expr that only contains SlotRefs referencing partition columns. I'm ok with disallowing constant exprs like "Literal <op> Literal" because those don't really make sense may indicate that the user made a mistake. Line 242: List<Expr> conjuncts = Lists.newArrayList(); Move this into a separate function and explain the reason for the transformation. Line 245: if (e instanceof BinaryPredicate) { To avoid deep indentation we typically prefer a style like this: if (!(e instanceof BinaryPredicate) continue; BinaryPredicate bp = ((BinaryPredicate) e); if (!bp.getOp().equals(EQ)) continue; This style choice is certainly arguable, but let's keep it this way for consistency with the rest of the code base. Also, I think you can use Predicate.isEquivalencePredicate() here. Line 247: if (bp.getOp().equals(EQ)) { In order to make this completely bulletproof I think we need to fold the constant expr children. For example, consider an expr like this: partition_col = (NULL + 10) We will not find a NullLiteral or StringLiteral child here (but we will after constant folding). I think you can use Expr.foldConstantChildren() for this purpose. We should also consider general predicates like: f(partition_col) = 10 + 20 i.e., ideally we'd not assume that any child is a SlotRef. The only requirement really is that a partition expr should only reference partition columns. Line 256: // so the old way of NULL partition specification doesn't break. "old way" is not a useful description. Instead you should describe the behavior and the reason for this behavior. The reason is that COL = NULL is allowed for selecting the NULL partition, but a COL = NULL predicate can never be true, so we need to transform such predicates before feeding them into the partition pruner. The important part to be commented on is "why" we do this transformation. Line 276: throw new AnalysisException("Partition expr evaluation failed in the backend."); pass in e as the cause Line 279: if (!conjuncts.isEmpty()) { remove Line 294: * Used only in ALTER TABLE ADD PARTITION. another indication that we should probably have two separate classes http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/catalog/Catalog.java File fe/src/main/java/com/cloudera/impala/catalog/Catalog.java: Line 417: if (partitionIds.isEmpty()) style: always use {} for multi-line ifs http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java File fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java: Line 639: * Gets hdfs partitions by partition id list. Returns null if no match was found. why not return an empty list? Line 643: for(Long id: partitionIds) { style: space after for Line 651: * Gets the hdfs partition by partition id list. Returns null if no match was found. remove "list" Line 950: if (partitions == null) return; Seems weird to silently accept a null input. This is a private function, let's Preconditions.checkNotNull(partitions); Line 951: for(HdfsPartition partition: partitions) { style: space after for http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/com/cloudera/impala/planner/HdfsPartitionPruner.java: Line 91: * If 'allowEmpty' is True, partitions that contain no file may survive during Suggest rephrasing to: If 'allowEmpty' is false empty partitions are not returned. Line 476: public List<SlotId> getSlotIds() { getPartitionSlotIds() http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java File fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java: Line 336: TResultRow resultRow = new TResultRow(); Using TResultRowBuilder might save a few lines. Line 337: Reference<Long> updated = new Reference<Long>(0L); numUpdatedPartitions Line 433: resultColVal.setString_val("Updated 1 table."); How about just "Updated table." Saying that we updated a single table may mislead the user to believe that we can update multiple tables at the same time. (same below) Line 453: if (params.getSet_tbl_properties_params().isSetPartition_ids()) { can you reverse this check to make the logic consistent with the one above for SET_FILE_FORMAT? Line 462: Reference<Long> col = new Reference<Long>(0L); numUpdatedCols Line 598: TDdlExecResponse resp, Reference<Long> part, Reference<Long> col) throws ImpalaException { part -> numUpdatedPartitions col -> numUpdatedCols modify function comment to describe these parameters (and elsewhere in this file) Line 665: // Set the results to be reported to the client. This code should be unnecessary now. Remove. Line 1794: String.format("Ignoring empty partition list when dropping partition from " + partitions Line 1816: msClient.getHiveClient().dropPartition( Let's switch to the bulk dropPartitions() API. Be sure to honor CatalogOpExecutor.MAX_PARTITION_UPDATES_PER_RPC There are existing examples of updating several partitions in bulk in this file. Line 1820: dropped += 1; ++dropped; Line 1962: applyAlterPartition(tbl, partition); let's update the partitions in bulk and not one at a time, take a look at bulkAlterPartitions() Line 2029: for(HdfsPartition partition: partitions) { update in bulk, look at bulkAlterPartitions() Line 2240: for (HdfsPartition partition : partitions) { update in bulk, see bulkAlterPartitions() Line 2245: updated += 1; ++updated; http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java: Line 110: "Partition spec in ADD PARTITION/LOAD DATA statement should be of " + leave out the ADD PARTITION/LOAD DATA part Line 111: "this format: <COL> = <LITERAL>, <LITERAL> = <COL> or <COL> is null."); IS NULL (capital) Line 118: AnalysisError("alter table functional.alltypes drop" + nit: let's use consistent spacing/chopping (some inconsistency, look at other tests below) personally, I prefer to leave a space on the preceding line like this: "hello world line 1 " + "continued here in line 2 " + "all done." Line 129: "alter table functional.alltypes add " + extra space here, see comment above regarding splitting up string literals Line 179: AnalysisError("alter table functional.alltypes drop " + already covered in the add/drop loop at the top Line 187: "exactly one partition SlotRef."); SlotRef is an internal term, prefer "column reference" Line 202: "Partition expr does not contain any partition."); This msg isn't quite accurate, it should say something like "No matching partition(s) found." Line 502: "All partition expr conjuncts should contain and only contain " + 'conjunct' is probably confusing to users, can you to something like: Partition specs may only reference partition columns: <expr.toSql() here> http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/test/java/com/cloudera/impala/analysis/ParserTest.java File fe/src/test/java/com/cloudera/impala/analysis/ParserTest.java: Line 1709 Why not change these negative parser tests tests into positive ones? http://gerrit.cloudera.org:8080/#/c/1563/9/shell/impala_client.py File shell/impala_client.py: Line 444: excluded_query_types = ['use', 'drop'] some ALTER statements now expect a result set and some don't, does this change break those alter statements that do not return a result set? This affects the Impala shell http://gerrit.cloudera.org:8080/#/c/1563/9/testdata/workloads/functional-query/queries/QueryTest/alter-table.test File testdata/workloads/functional-query/queries/QueryTest/alter-table.test: Line 515: alter table t_part drop partition (j=2, s is null) the old test was legitimate and interesting, let's not change it. If you want to test the IS NULL functionality, add a new test. http://gerrit.cloudera.org:8080/#/c/1563/9/testdata/workloads/functional-query/queries/QueryTest/partition-ddl-predicates.test File testdata/workloads/functional-query/queries/QueryTest/partition-ddl-predicates.test: Line 1: ==== It feels like we are missing tests around the COL=NULL and COL='' rewrites into COL IS NULL Line 42: show files in p1 partition (j<2, k="a") what should happen when I write something like show files in p1 partition (j<2 and k="a") http://gerrit.cloudera.org:8080/#/c/1563/9/tests/metadata/test_ddl.py File tests/metadata/test_ddl.py: Line 507: def test_partition_ddl_predicates(self, vector): use the unique_database test fixture here. you'll need to rebase to get it, so let's wait for this until the very end -- To view, visit http://gerrit.cloudera.org:8080/1563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Gerrit-PatchSet: 9 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Amos Bird <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Amos Bird <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-HasComments: Yes
