Tim Armstrong has posted comments on this change. Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/2779/3/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 195: RowBatch batch; The embedded RowBatch is a little funky but this seems like the best solution for now. Line 207: //~ScratchTupleBatch() { pool->FreeAll(); } ? Line 1729: ScratchTupleBatch Consider adding __restrict__ annotation. I suspect all of the member variable lookups (num_tuples, tuple_idx, etc) have to go through the pointer every iteration because compiler can't infer that scratch_batch doesn't alias a pointer you've written through in the loop (-fno-strict-aliasing means that the compiler can't infer that the memory is disjoint based on pointer types). Also consider making batch_ an argument and doing the same. Not sure if this is a performance bottleneck, but we're doing a lot of memory copying in this loop, so if we can reduce extraneous loads and stores it will probably help. Line 1738: scan_node_-> If this loop is performance-sensitive, consider hoisting this pointer indirection out of the loop (compiler probably can't do it with -fno-strict-aliasing). Line 1742: scanner_conjunct_ctxs_ Not your change, but we should consider just copying the scanner_conjunct_ctxs_ vector into the scan node and save a pointer indirection. When I was previously working on the scanner it seemed fairly sensitive to cache/TLB effects, probably because each column has a lot of state: reducing pointer indirections and therefore the memory footprint of the working set seemed to help. We could also hoist the empty() value lookups into local variables outside of the loop (we still have the alias problem: the "this" pointer could have been written to through another pointer as far as compiler is concerned). Line 1823: int num_row_to_commit = TransferScratchTuples(scratch_batch); There's an assumption here that the remaining scratch rows will fit in the output batch. Can we assert this with a DCHECK? http://gerrit.cloudera.org:8080/#/c/2779/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 249: // significantly better than UNLIKELY(literal_count_ == 0 && repeat_count_ == 0) Hopefully as this progresses we might be able to use a batched interface that would be have a less branchy inner loop and wouldn't be prone to these problems. Line 250: if (repeat_count_ == 0) { Would (repeat_count_ + literal_count_) == 0 be better still? Not sure if this branch is still showing up in profiling. -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
