Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries ......................................................................
Patch Set 7: (8 comments) http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries > There is a bit of a disconnect with the JIRA as specified and the intention Thanks for the clarification. Agree this is the best path forward for the moment. We should make sure to straighten out the JIRAs though so that the JIRA attached to this diff actually reflects the work you're doing, and that we track the other optimisation as separate work. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 77: tuple_index = scratch_batch_->NextValidTuple()) { > Your intuition and mine are similar. I did my best to make the "simple" ve +1 no point optimising it until the code around it has settled down. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 876: RETURN_IF_ERROR(scalar_reader->InitDictionary(dict_filter_tuple)); This is orthogonal, but we really should evaluate single-slot runtime filters against the dictionary as well. The payoff there is potentially really big - they can be quite selective and are expensive to evaluate. Do we have a JIRA for that? http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { > My guess is that inlining the functions from the header should get about as There's also a design layering question about whether the ColumnReader abstraction should know about ScratchTupleBatch. IMO there's a cleaner separation if ColumnReaders only care about materialising a column into a memory buffer and setting bits based on predicates as fast as possible, not about how we're staging tuples in a scratch batch. Even after inlining it adds an additional pointer indirection via 'scratch'. We can try to coax the compiler into undoing what we want it to do but at some point it's more robust and adds less cognitive load if we just write code that's closer to what we want the compiler to generate. On a lot of queries we spend up to 50% of CPU time in this function so the perf matters a lot (I'm not just being anal about shaving CPU cycles). We've had a couple rounds of attempting to optimise this function and it is *very* sensitive. Even then we left a lot of performance on the floor for subtle reasons: see https://gerrit.cloudera.org/#/c/6950/. The big problem is that we have multiple layers of non-trivial inlined functions inside that loop and it's hard to reason about what the compiler and CPU will actually be doing. PS7, Line 408: IS_FILTERED > The behavior is different as a class, not just at this function level, so I It looks like the only perf-critical place that references IS_FILTERED are in these two templatised functions, so I'm still skeptical that it makes sense to templatise the whole class - otherwise it could just be a member variable. I don't see a reason to treat it differently from IS_DICT_ENCODED. Line 410: return false; I'm going to retract my previous comment about materializing the indices as a separate step. I played around with a prototype of batching the slot materialisation and the best design looks like making the dictionary decoder RLE-aware so that it can only does one dictionary lookup per run. See https://github.com/timarmstrong/incubator-impala/commit/d45d8fae03d2645683f82ece21fce9310f78ac7a#diff-66ee012551a23c734bbe834529dd3b3dR306 (warning - it's pretty rough). I think that would mean pushing the bitmap check down into the dictionary decoder, which is something very different from what I was talking about. That requires rearranging a lot of code in a more fundamental so the short-term change to ReadFilteredSlot() doesn't really matter either way. Line 417: if (IS_FILTERED) batch.InvalidateTuple(val_index); > True. (Was not so in original code). Do you have a repro? If it's a correctness bug we should make sure we're on top of it. I agree we're not strict about validating that the def levels are in the expected range and we could accept a technically-invalid parquet file as valid input. I think it's reasonable that we don't exhaustively validate the file so long as we don't overrun any buffers etc as a result. I don't see how that leads to any further consequences though. Impala assumes that all slots coming out of HDFS scans are nullable, so it can't be non-NULLable from Impala's point of view. If the Parquet file claims that a column is non-NULLable within that file (max_def_level == 0) then I don't think we will ever materialise a NULL value, so I don't think we do the wrong thing there even if the file is invalid. Line 609: bool true_on_null_; > I tried to do some optimization based on this and allowed passing in predic It does seem quite useful - in practice it's common to have columns with many NULLs (less so in benchmarks). I'm in favour of keeping the diff manageable though, but we should file a JIRA to do as a follow-on IMO so we don't lose track of it. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zach Amsden <[email protected]> Gerrit-HasComments: Yes
