[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries

Zach Amsden (Code Review) Mon, 22 May 2017 13:34:01 -0700

Zach Amsden has posted comments on this change.

Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries
......................................................................

Patch Set 7:

(17 comments)

Thanks for looking at this!

http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries
> The JIRA seems to be talking about doing something slightly different: taki
There is a bit of a disconnect with the JIRA as specified and the intention
behind the JIRA, which had me initially working down the wrong path. I
realized that (x OP k) type predicates can only be evaluated for OP = or <=>,
and started working on a way to identify such predicates in sub-expressions of
the conjuncts. It all seemed quite speculative and limited in applicability.
Once can imagine expanding this to ordering predicates with sorted
dictionaries, but as we don't have that yet, it would be building without a
foundation.

After clarification from Marcel, it became apparent that what was intended was
to be able to evaluate arbitrary single slot predicates against the dictionary
(as we do this anyway for row-group filtering) and save those results to
eliminate the cost of re-evaluating the same conjuncts again later.

The central difficulty of this problem is that the set of conjuncts which can
be dictionary evaluated is not known until row group time.

This is the best proposal that I have come up with so far at attempting to
solve that problem.

I agree we should get the design right on this one. Doing this wrong could be
costly in terms of either performance or complexity.

http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc
File be/src/exec/hdfs-parquet-scanner-ir.cc:

Line 40
> I believe this code was deliberately written like this to avoid the indirec
I have no problem reverting this. Originally I had the two functions merged
into one, but I didn't like imposing the overhead of checking the bitmap on
this version. While the code evolved, I had to put an interface on
ScratchBatch just for sanity, as too many pieces of code were playing with its
internals.

I think the interface is actually broad enough now to go back to the original
code (minus the interference with internals). I'll simply do this at the end:

scratch_batch_->Advance((scratch_tuple - scratch_tuple_start) / tuple_size);

Line 77: tuple_index = scratch_batch_->NextValidTuple()) {
> It seems like this might be slow for the case when many or all of the tuple
Your intuition and mine are similar. I did my best to make the "simple"
version of NextSetBit() as fast as possible, but I believe intrinsics might
actually be warranted here.

Wanted to get the core ideas sound before going in that direction though.

Oh, and for reference, I also tried various other bitmap implementations -
boost::dynamic_bitset and std::vector<bool>. I didn't like the boost
dependency (or the code too much) and the std::vector<bool> (specialized
bit-vector) implementation compiled to assembler sent me running for the hills.

Branches, branches everywhere...

http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 935
> I'm not sure, but this hoisting may have been deliberate because the compil
It certainly shouldn't hurt to re-hoist

Line 208: dict_filters_active_.reset(new
bool[scanner_conjunct_ctxs_->size()]);
> This probably works fine in practice but afaik scoped_ptr will call free in
Will check

PS7, Line 1051: _
> nit: extraneous _
Originally a class member, until I realized this function had no access. Will
fix.

http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) {
> I think it would be best to avoid using the ScratchTupleBatch abstraction i
My guess is that inlining the functions from the header should get about as
optimal as possible.

Maybe hoisting 'max = scratch.capacity()' to tell the compiler it is unchanged
throughout the loop, but other than that, I don't see much room for improvement
that has been lost. Let's see what perf results look like when that is ready.

I don't want to pass the bitmap directly, as I only want the bitmap to be
touched (or even known about) by those pieces of code which actually are
running in a filtered column reader.

Line 408: if (IS_FILTERED && IS_DICT_ENCODED &&
> This patch pushes the predicate evaluation down into ReadFilteredSlot() so
Perhaps there is a way to do this, but I don't see a clear path to that. We
have some choices, but they are all problematic:

Materialize codewords and values. Where do the codewords go? There is no
space in the tuple.

Materialize codewords only. What do we do when there is no dictionary?

Materialize codewords and values conditionally. Now we have to write code to
deal with either of the two - and to go back inside the dictionary and convert
codewords back into values, while the entire functionality to do that was
hidden inside the private details of the dict<T> decoder and not exposed to the
scanner at all.

In any case, doing predicate evaluation columnwise is certainly interesting
(SIMD combined with a lazy value lookup cache seems especially promising), but
it completely wreaks havoc with expression evaluation order, and I think
dealing with the ramifications of that is probably beyond the scope of this
diff.

Here, we are just trying to make use of our pre-computed dictionary filter
results as efficiently as possible.

PS7, Line 408: IS_FILTERED
> I think IS_FILTERED could just be a function template parameter similar to
The behavior is different as a class, not just at this function level, so I
don't see how that would work.

Line 417: if (IS_FILTERED) batch.InvalidateTuple(val_index);
> This and setting the null indicator can be mutually exclusive, right? Sin
True. (Was not so in original code).

There's actually a bug here in our decoding, but I don't want to open that
right now. We don't throw any kind of error on getting a NULL decoding for a
non-NULLable column.

PS7, Line 468: !column_has_match_
> Maybe rename the member variable to match the function name. I think the cu
Done

Line 609: bool true_on_null_;
> Unused? It does seem like we need this for correctness in some cases, right
I tried to do some optimization based on this and allowed passing in predicates
that were true on null in the last diff, but it looked very speculative and
ended up being a lot of work and complexity. I'll get rid of this.

http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 24: #include "exec/parquet-scratch-tuple-batch.h"
> I think we only need a forward declaration in this header - not a big deal
Will do the bare minimum I can get away with.

Unfortunately very familiar with the compile time problem.

PS7, Line 374: aller must pass a tuple that can be used to
: // materialize temporary values from the dictionary.
> I think this needs updating - there's a callsite that passes in nullptr.
Done

Line 391: // Returns true if the row group has no columns which pass
filtering conjuncts,
> The function name and comment aren't too enlightening for me.
Done

http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-scratch-tuple-batch.h
File be/src/exec/parquet-scratch-tuple-batch.h:

Line 115: Bitmap valid_bitmap_;
> We should consider if it's better to use a bitmap or an array of bools. The
I think our bitmap implementation is quite good - if we manage to get this
compiled down to a `testb M, bit`, we should be in good shape.

Again lets wait for benchmarks to tell us the answer.

http://gerrit.cloudera.org:8080/#/c/6726/7/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 25: parquet dictionary predicates: int_col IS NULL, int_col > 1
Oops, this needs reverting since I removed support for true on NULL predicates.

--
To view, visit http://gerrit.cloudera.org:8080/6726
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zach Amsden <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries

Reply via email to