Alex Behm has posted comments on this change.

Change subject: IMPALA-1654: Support general predicates in most partition DDL 
operations.
......................................................................


Patch Set 9:

(60 comments)

http://gerrit.cloudera.org:8080/#/c/1563/9/be/src/service/query-exec-state.cc
File be/src/service/query-exec-state.cc:

Line 530:   const TDdlExecResponse* ddl_resp = 
catalog_op_executor_->ddl_exec_response();
Add a new SetResultSet(TDdlExecResponse*) helper function, and use it here as 
well as in UpdateTableAndColumnStats().


http://gerrit.cloudera.org:8080/#/c/1563/9/common/thrift/Frontend.thrift
File common/thrift/Frontend.thrift:

Line 206: struct TShowFilesParams {
I know we talked about this before but I'm bringing it up again because we've 
recently discovered new bugs in this area. 

I'm concerned about the behavioral changes introduced with this move to 
partition ids instead of KVs. When we run an "invalidate metadata", the table 
ids and partition ids get reset, so I'm thinking that most of these operations 
that use partition ids will fail with concurrent metadata reloads (because the 
partition ids change every time).

Unfortunately, in practice, metadata invalidations are run very frequently, so 
I'm concerned about users suddenly facing problems after upgrading.

I asked Bharath for confirmation and he agrees we should go back to using the 
TPartitionKeyValue to avoid subtle behavioral regressions under concurrency.


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 1549: // clause in an INSERT statement. Partition clause contains dynamic 
and static
I don't think the partition clause for INSERT is different now. Dynamic 
partition keys are just a SlotRef expr and static partition keys are <SlotRef> 
= <Constant> BinaryPredicates, so it is possible to use this clause for INSERT 
as well (with changes to the analysis logic).

Let's leave that for future work because this patch is already substantial 
enough.

Can you add a TODO here for that cleanup? Once this patch goes in we should 
also file a JIRA to track this extra cleanup.


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java
File 
fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java:

Line 52:       
Preconditions.checkState(getPartitionSpec().getPartitionIds().size() == 1);
use isEmpty()


Line 71:         if (partitions.size() != 1)
use !isEmpty()


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java:

Line 91:     // Register this table so we can evaluate partition predicates.
Resolve and analyze this table ref so we can evaluate partition predicates.

(and change elsewhere)


Line 92:     tableRef_ = analyzer.resolveTableRef(tableRef_);
Add a Preconditions.checkState(tableRef_ instanceof BaseTableRef);


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/AnalysisContext.java
File fe/src/main/java/com/cloudera/impala/analysis/AnalysisContext.java:

Line 435:         Preconditions.checkState(
I spent a lot of time thinking about the implications of these changes. I think 
it makes sense to ignore the column-level privilege checks for these statements 
because we should not allow these operations even if the user has sufficient 
privileges on all columns (but none on the table). We really want to require 
table-level privileges and the cleanest way of doing that is to not register 
the column-level privilege requests in the first place.

Otherwise, we'll need a lengthy comment here explaining why we ignore some 
privilege requests for these statements.

Instead, I suggest that you temporarily disable the registration of privilege 
requests when analyzing the partition exprs in PartitionSpec.analyze(). You can 
use Analyzer.setEnablePrivChecks() to enable/disable.

Then you can remove the changes here.

We should also add a test that asserts the correct authorization behavior, 
i.e., that having all column-level privileges is not sufficient to run a "show 
files partition", etc.


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/ComputeStatsStmt.java:

Line 278:         tableRef_.analyze(analyzer);
Add Preconditions check that tableRef is a base table ref.


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java:

Line 200:         if (partitionSpec_.getPartitions().size() != 1) {
use isEmpty()


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/analysis/PartitionSpec.java
File fe/src/main/java/com/cloudera/impala/analysis/PartitionSpec.java:

Line 51:   private TableRef tableRef_;
You don't need to set this. Instead you can use:

analyzer.getDescriptor(tableName_.toString());

This also makes sure that you are using the correct analyzer.


Line 57:   // Flag used to tell if it is using the old KV style PartitionSpec 
because only they
Referring to "old" behavior is generally not useful, because somebody reading 
the comment today may not know the entire history behind this code's evolution. 
It's best to describe what this flag does.


Line 58:   // require partition specs in a particular KV form. See 
PARTITION_EXPR_ERROR."
extra " at end


Line 85:     List partitionIds = Lists.newArrayList();
List<Long>


Line 130:     if (isKV_) {
It seems more appropriate to distinguish between a PartitionSpec and a 
PartitionSetSpec, instead of having this flag here. The PartitionSpec would be 
the one used for ADD PARTITION and LOAD DATA, and the PartitionSetSpec would be 
used for those statements that can work on a set of partitions. What do you 
think?


Line 228:     // one partition SlotRef such that conjuncts like "SlotRef <op> 
Literal" works
work


Line 229:     // while "SlotRef <op> SlotRef" and "Literal <op> Literal" don't.
Why these restrictions? We should be able to handle "SlotRef <op> SlotRef" just 
fine, or any other expr that only contains SlotRefs referencing partition 
columns.

I'm ok with disallowing constant exprs like "Literal <op> Literal" because 
those don't really make sense may indicate that the user made a mistake.


Line 242:     List<Expr> conjuncts = Lists.newArrayList();
Move this into a separate function and explain the reason for the 
transformation.


Line 245:       if (e instanceof BinaryPredicate) {
To avoid deep indentation we typically prefer a style like this:

if (!(e instanceof BinaryPredicate) continue;
BinaryPredicate bp = ((BinaryPredicate) e);
if (!bp.getOp().equals(EQ)) continue;

This style choice is certainly arguable, but let's keep it this way for 
consistency with the rest of the code base.

Also, I think you can use Predicate.isEquivalencePredicate() here.


Line 247:         if (bp.getOp().equals(EQ)) {
In order to make this completely bulletproof I think we need to fold the 
constant expr children. For example, consider an expr like this:

partition_col = (NULL + 10)

We will not find a NullLiteral or StringLiteral child here (but we will after 
constant folding).

I think you can use Expr.foldConstantChildren() for this purpose.

We should also consider general predicates like:

f(partition_col) = 10 + 20

i.e., ideally we'd not assume that any child is a SlotRef.

The only requirement really is that a partition expr should only reference 
partition columns.


Line 256:             // so the old way of NULL partition specification doesn't 
break.
"old way" is not a useful description. Instead you should describe the behavior 
and the reason for this behavior. The reason is that COL = NULL is allowed for 
selecting the NULL partition, but a COL = NULL predicate can never be true, so 
we need to transform such predicates before feeding them into the partition 
pruner.

The important part to be commented on is "why" we do this transformation.


Line 276:       throw new AnalysisException("Partition expr evaluation failed 
in the backend.");
pass in e as the cause


Line 279:     if (!conjuncts.isEmpty()) {
remove


Line 294:    * Used only in ALTER TABLE ADD PARTITION.
another indication that we should probably have two separate classes


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/catalog/Catalog.java
File fe/src/main/java/com/cloudera/impala/catalog/Catalog.java:

Line 417:     if (partitionIds.isEmpty())
style: always use {} for multi-line ifs


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java:

Line 639:    * Gets hdfs partitions by partition id list. Returns null if no 
match was found.
why not return an empty list?


Line 643:     for(Long id: partitionIds) {
style: space after for


Line 651:    * Gets the hdfs partition by partition id list. Returns null if no 
match was found.
remove "list"


Line 950:     if (partitions == null) return;
Seems weird to silently accept a null input. This is a private function, let's 
Preconditions.checkNotNull(partitions);


Line 951:     for(HdfsPartition partition: partitions) {
style: space after for


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/com/cloudera/impala/planner/HdfsPartitionPruner.java:

Line 91:    * If 'allowEmpty' is True, partitions that contain no file may 
survive during
Suggest rephrasing to:

If 'allowEmpty' is false empty partitions are not returned.


Line 476:   public List<SlotId> getSlotIds() {
getPartitionSlotIds()


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java
File fe/src/main/java/com/cloudera/impala/service/CatalogOpExecutor.java:

Line 336:     TResultRow resultRow = new TResultRow();
Using TResultRowBuilder might save a few lines.


Line 337:     Reference<Long> updated = new Reference<Long>(0L);
numUpdatedPartitions


Line 433:             resultColVal.setString_val("Updated 1 table.");
How about just "Updated table."

Saying that we updated a single table may mislead the user to believe that we 
can update multiple tables at the same time.

(same below)


Line 453:           if 
(params.getSet_tbl_properties_params().isSetPartition_ids()) {
can you reverse this check to make the logic consistent with the one above for 
SET_FILE_FORMAT?


Line 462:           Reference<Long> col = new Reference<Long>(0L);
numUpdatedCols


Line 598:       TDdlExecResponse resp, Reference<Long> part, Reference<Long> 
col) throws ImpalaException {
part -> numUpdatedPartitions
col -> numUpdatedCols

modify function comment to describe these parameters

(and elsewhere in this file)


Line 665:     // Set the results to be reported to the client.
This code should be unnecessary now. Remove.


Line 1794:             String.format("Ignoring empty partition list when 
dropping partition from " +
partitions


Line 1816:           msClient.getHiveClient().dropPartition(
Let's switch to the bulk dropPartitions() API. Be sure to honor 
CatalogOpExecutor.MAX_PARTITION_UPDATES_PER_RPC

There are existing examples of updating several partitions in bulk in this file.


Line 1820:           dropped += 1;
++dropped;


Line 1962:           applyAlterPartition(tbl, partition);
let's update the partitions in bulk and not one at a time, take a look at 
bulkAlterPartitions()


Line 2029:       for(HdfsPartition partition: partitions) {
update in bulk, look at bulkAlterPartitions()


Line 2240:       for (HdfsPartition partition : partitions) {
update in bulk, see bulkAlterPartitions()


Line 2245:             updated += 1;
++updated;


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java
File fe/src/test/java/com/cloudera/impala/analysis/AnalyzeDDLTest.java:

Line 110:                   "Partition spec in ADD PARTITION/LOAD DATA 
statement should be of " +
leave out the ADD PARTITION/LOAD DATA part


Line 111:                   "this format: <COL> = <LITERAL>, <LITERAL> = <COL> 
or <COL> is null.");
IS NULL (capital)


Line 118:     AnalysisError("alter table functional.alltypes drop" +
nit: let's use consistent spacing/chopping (some inconsistency, look at other 
tests below)

personally, I prefer to leave a space on the preceding line like this:

"hello world line 1 " +
"continued here in line 2 " +
"all done."


Line 129:         "alter table functional.alltypes add " +
extra space here, see comment above regarding splitting up string literals


Line 179:     AnalysisError("alter table functional.alltypes drop " +
already covered in the add/drop loop at the top


Line 187:         "exactly one partition SlotRef.");
SlotRef is an internal term, prefer "column reference"


Line 202:         "Partition expr does not contain any partition.");
This msg isn't quite accurate, it should say something like "No matching 
partition(s) found."


Line 502:                   "All partition expr conjuncts should contain and 
only contain " +
'conjunct' is probably confusing to users, can you to something like:

Partition specs may only reference partition columns: <expr.toSql() here>


http://gerrit.cloudera.org:8080/#/c/1563/9/fe/src/test/java/com/cloudera/impala/analysis/ParserTest.java
File fe/src/test/java/com/cloudera/impala/analysis/ParserTest.java:

Line 1709
Why not change these negative parser tests tests into positive ones?


http://gerrit.cloudera.org:8080/#/c/1563/9/shell/impala_client.py
File shell/impala_client.py:

Line 444:     excluded_query_types = ['use', 'drop']
some ALTER statements now expect a result set and some don't, does this change 
break those alter statements that do not return a result set? This affects the 
Impala shell


http://gerrit.cloudera.org:8080/#/c/1563/9/testdata/workloads/functional-query/queries/QueryTest/alter-table.test
File testdata/workloads/functional-query/queries/QueryTest/alter-table.test:

Line 515: alter table t_part drop partition (j=2, s is null)
the old test was legitimate and interesting, let's not change it. If you want 
to test the IS NULL functionality, add a new test.


http://gerrit.cloudera.org:8080/#/c/1563/9/testdata/workloads/functional-query/queries/QueryTest/partition-ddl-predicates.test
File 
testdata/workloads/functional-query/queries/QueryTest/partition-ddl-predicates.test:

Line 1: ====
It feels like we are missing tests around the COL=NULL and COL='' rewrites into 
COL IS NULL


Line 42: show files in p1 partition (j<2, k="a")
what should happen when I write something like

show files in p1 partition (j<2 and k="a")


http://gerrit.cloudera.org:8080/#/c/1563/9/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 507:   def test_partition_ddl_predicates(self, vector):
use the unique_database test fixture here. you'll need to rebase to get it, so 
let's wait for this until the very end


-- 
To view, visit http://gerrit.cloudera.org:8080/1563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f
Gerrit-PatchSet: 9
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Amos Bird <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Amos Bird <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-HasComments: Yes

Reply via email to