Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries
......................................................................


Patch Set 7:

(19 comments)

Looks promising. We're likely to be building more things on top of this, and 
the potential impact is pretty big if we made the right design decisions, so I 
want to make sure that we're making the right choices here - I think there are 
a few decision points we need to think through careful.

http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries
The JIRA seems to be talking about doing something slightly different: taking 
predicates of form (x OP constant), translating the constant into the codeword, 
then comparing dict_index directly to the codeword. That avoids any kind of 
dictionary or bitmap lookup and can be vectorized so can be very effective. 
There are various papers about this kind of trick - I can find links if you 
haven't seen them. I think maybe we need to split out this work into a separate 
JIRA.

Your current approach applies to a much larger range of predicates so is also 
very useful. It would be good to think about whether the infrastructure will 
support that kind of optimisation in the future. I think doing a more 
column-oriented approach might make it more natural to do multiple passes over 
the column.


Line 41: function per file format, so we would have to simulate a file format
Yeah that codegen'd function map is a bit of a mess.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc
File be/src/exec/hdfs-parquet-scanner-ir.cc:

Line 40
I believe this code was deliberately written like this to avoid the indirection 
via this->scratch_batch_ in the tight loop below - I think we should leave this 
alone.


Line 77:        tuple_index = scratch_batch_->NextValidTuple()) {
It seems like this might be slow for the case when many or all of the tuples 
are valid, since NextSetBit() is fairly expensive compared to incrementing a 
counter. It's probably worth waiting to see performance numbers, but my 
intuition is that we'll need to make sure that advancing to the next bit is 
cheaper.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 935
I'm not sure, but this hoisting may have been deliberate because the compiler 
couldn't hoist the load via scratch_batch_ out of the loop


Line 208:   dict_filters_active_.reset(new 
bool[scanner_conjunct_ctxs_->size()]);
This probably works fine in practice but afaik scoped_ptr will call free 
instead of free[] on the array. I think unique_ptr<bool[]> does the right thing.


Line 865:     // Legacy impala files cannot be eliminated here, because the 
only way to
I think we're missing an opportunity to apply the optimisation to columns with 
a mix of plain and dictionary-encoded pages.

My understanding is that the pre-existing dictionary filtering optimisation 
only works when the whole column is dictionary-encoded, but your new 
optimisation can be applied on a page-by-page basis.

I'm not sure how hard it is to restructure things to get it to work in that 
case. I think it's probably fairly low-impact because most columns will either 
be predominantly plain-encoded or predominantly dict-encoded, but we should add 
a comment so that the limitation is explicit.


PS7, Line 1051: _
nit: extraneous _


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 436:   /// XXX Why is this a scoped_ptr ?
IIRC so we don't need to include the full header.


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 317:   bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) {
I think it would be best to avoid using the ScratchTupleBatch abstraction in 
this code - these functions are perf critical and it's easier to reason about 
performance when the abstraction is stripped away and the code is directly 
operating over contiguous arrays as much as possible.

Maybe just pass in the bitmap directly?


Line 408:           if (IS_FILTERED && IS_DICT_ENCODED &&
This patch pushes the predicate evaluation down into ReadFilteredSlot() so that 
it's done value-by-value as they are materialised.

I'm not convinced this is the best approach - I think we should consider 
peeling predicate evaluation out into a separate loop and doing it columnwise. 
This would mean that ReadSlot() or some variant would materialize the values 
and the codewords into an array, then we'd do another pass over the codeword 
array to evaluate the conjunctions. Or maybe materialize the codewords in one 
pass, then doing another pass to evaluate conjuncts, then another one to lookup 
the surviving values in the dictionary. I think there are lots of possible 
permutations.

This would give tighter loops and more scope for future optimisations like SIMD 
and should also avoid the need to add another template parameter to the column 
readers.


PS7, Line 408: IS_FILTERED
I think IS_FILTERED could just be a function template parameter similar to 
IS_DICT_ENCODED and IN_COLLECTION.


Line 417:           if (IS_FILTERED) batch.InvalidateTuple(val_index);
This and setting the null indicator can be mutually exclusive,  right?  Since 
the null indicator doesn't mean anything in that case.


PS7, Line 468:  !column_has_match_
Maybe rename the member variable to match the function name. I think the 
current name is slightly confusing because it's not clear what it means if the 
column has some plain-encoded data.


Line 609:   bool true_on_null_;
Unused? It does seem like we need this for correctness in some cases, right?


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 24: #include "exec/parquet-scratch-tuple-batch.h"
I think we only need a forward declaration in this header - not a big deal but 
compile times start getting worse if we're not vigilant.


PS7, Line 374: aller must pass a tuple that can be used to
             :   // materialize temporary values from the dictionary.
I think this needs updating - there's a callsite that passes in  nullptr.


Line 391:   // Returns true if the row group has no columns which pass 
filtering conjuncts,
The function name and comment aren't too enlightening for me.

"Returns true if this column has no values that pass conjunctions. This can be 
determined in some cases based on the dictionary values."

Maybe something like RowGroupFailsConjuncts() or AllValuesFailConjuncts().


http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 115:   Bitmap valid_bitmap_;
We should consider if it's better to use a bitmap or an array of bools. The 
bitmap has better cache density, but will require more operations to read and 
write each value. We're only going to have 1024 values typically, so it might 
be worth wasting cache to save instructions.


-- 
To view, visit http://gerrit.cloudera.org:8080/6726
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to