Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries ......................................................................
Patch Set 7: (19 comments) Looks promising. We're likely to be building more things on top of this, and the potential impact is pretty big if we made the right design decisions, so I want to make sure that we're making the right choices here - I think there are a few decision points we need to think through careful. http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries The JIRA seems to be talking about doing something slightly different: taking predicates of form (x OP constant), translating the constant into the codeword, then comparing dict_index directly to the codeword. That avoids any kind of dictionary or bitmap lookup and can be vectorized so can be very effective. There are various papers about this kind of trick - I can find links if you haven't seen them. I think maybe we need to split out this work into a separate JIRA. Your current approach applies to a much larger range of predicates so is also very useful. It would be good to think about whether the infrastructure will support that kind of optimisation in the future. I think doing a more column-oriented approach might make it more natural to do multiple passes over the column. Line 41: function per file format, so we would have to simulate a file format Yeah that codegen'd function map is a bit of a mess. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 40 I believe this code was deliberately written like this to avoid the indirection via this->scratch_batch_ in the tight loop below - I think we should leave this alone. Line 77: tuple_index = scratch_batch_->NextValidTuple()) { It seems like this might be slow for the case when many or all of the tuples are valid, since NextSetBit() is fairly expensive compared to incrementing a counter. It's probably worth waiting to see performance numbers, but my intuition is that we'll need to make sure that advancing to the next bit is cheaper. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 935 I'm not sure, but this hoisting may have been deliberate because the compiler couldn't hoist the load via scratch_batch_ out of the loop Line 208: dict_filters_active_.reset(new bool[scanner_conjunct_ctxs_->size()]); This probably works fine in practice but afaik scoped_ptr will call free instead of free[] on the array. I think unique_ptr<bool[]> does the right thing. Line 865: // Legacy impala files cannot be eliminated here, because the only way to I think we're missing an opportunity to apply the optimisation to columns with a mix of plain and dictionary-encoded pages. My understanding is that the pre-existing dictionary filtering optimisation only works when the whole column is dictionary-encoded, but your new optimisation can be applied on a page-by-page basis. I'm not sure how hard it is to restructure things to get it to work in that case. I think it's probably fairly low-impact because most columns will either be predominantly plain-encoded or predominantly dict-encoded, but we should add a comment so that the limitation is explicit. PS7, Line 1051: _ nit: extraneous _ http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 436: /// XXX Why is this a scoped_ptr ? IIRC so we don't need to include the full header. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { I think it would be best to avoid using the ScratchTupleBatch abstraction in this code - these functions are perf critical and it's easier to reason about performance when the abstraction is stripped away and the code is directly operating over contiguous arrays as much as possible. Maybe just pass in the bitmap directly? Line 408: if (IS_FILTERED && IS_DICT_ENCODED && This patch pushes the predicate evaluation down into ReadFilteredSlot() so that it's done value-by-value as they are materialised. I'm not convinced this is the best approach - I think we should consider peeling predicate evaluation out into a separate loop and doing it columnwise. This would mean that ReadSlot() or some variant would materialize the values and the codewords into an array, then we'd do another pass over the codeword array to evaluate the conjunctions. Or maybe materialize the codewords in one pass, then doing another pass to evaluate conjuncts, then another one to lookup the surviving values in the dictionary. I think there are lots of possible permutations. This would give tighter loops and more scope for future optimisations like SIMD and should also avoid the need to add another template parameter to the column readers. PS7, Line 408: IS_FILTERED I think IS_FILTERED could just be a function template parameter similar to IS_DICT_ENCODED and IN_COLLECTION. Line 417: if (IS_FILTERED) batch.InvalidateTuple(val_index); This and setting the null indicator can be mutually exclusive, right? Since the null indicator doesn't mean anything in that case. PS7, Line 468: !column_has_match_ Maybe rename the member variable to match the function name. I think the current name is slightly confusing because it's not clear what it means if the column has some plain-encoded data. Line 609: bool true_on_null_; Unused? It does seem like we need this for correctness in some cases, right? http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 24: #include "exec/parquet-scratch-tuple-batch.h" I think we only need a forward declaration in this header - not a big deal but compile times start getting worse if we're not vigilant. PS7, Line 374: aller must pass a tuple that can be used to : // materialize temporary values from the dictionary. I think this needs updating - there's a callsite that passes in nullptr. Line 391: // Returns true if the row group has no columns which pass filtering conjuncts, The function name and comment aren't too enlightening for me. "Returns true if this column has no values that pass conjunctions. This can be determined in some cases based on the dictionary values." Maybe something like RowGroupFailsConjuncts() or AllValuesFailConjuncts(). http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 115: Bitmap valid_bitmap_; We should consider if it's better to use a bitmap or an array of bools. The bitmap has better cache density, but will require more operations to read and write each value. We're only going to have 1024 values typically, so it might be worth wasting cache to save instructions. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
