Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks ......................................................................
IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks Refactor DataSink interface to be more generic. We need more flexibility in setting up MemTrackers, so that memory is accounted against the right ExecNode. Also removes some redundancy between DataSink subclasses in setting up RuntimeProfiles, etc. Remove the redundancy in the DataSink between passing eos to GetNext() and FlushFinal(). This simplifies HdfsTableSink quite a bit and makes handling empty batches simpler. Partially refactor join nodes so that the control flow between BlockingJoinNode::Open() and its subclasses is easier to follow. BlockingJoinNode now only calls one virtual function on its subclasses: ConstructJoinBuild(). Once we convert all join nodes to use the DataSink interface, we will also be able to remove that as well. As a minor optimisation, avoid updating a timer that is ignored for non-async builds. As a proof of concept, this patch separates out the build side of NestedLoopJoinNode into a class that implements the DataSink interface. The main challenge here is that NestedLoopJoinNode recycles row batches to avoid reallocations and copies of row batches in subplans. The solution to this is: - Retain the special-case optimisation for SingularRowSrc - Use the row batch cache and RowBatch::AcquireState() to copy the state of row batches passed to Send(), while recycling the RowBatch objects. Refactoring the partitioned hash join is left for Part 2. Testing: Ran exhaustive, core ASAN, and exhaustive non-partioned agg/join builds. Also ran a local stress test. Performance: Ran TPC-H nested locally. The results show that the change is perf-neutral. +------------------+-----------------------+---------+------------+------------+----------------+ | Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +------------------+-----------------------+---------+------------+------------+----------------+ | TPCH_NESTED(_20) | parquet / none / none | 7.75 | +0.19% | 5.64 | -0.34% | +------------------+-----------------------+---------+------------+------------+----------------+ +------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters | +------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ | TPCH_NESTED(_20) | TPCH-Q17 | parquet / none / none | 18.96 | 17.95 | +5.61% | 4.97% | 0.71% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q14 | parquet / none / none | 3.61 | 3.56 | +1.25% | 0.97% | 1.19% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q8 | parquet / none / none | 6.25 | 6.23 | +0.44% | 0.44% | 0.90% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q10 | parquet / none / none | 5.84 | 5.83 | +0.30% | 1.21% | 1.82% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q5 | parquet / none / none | 4.91 | 4.90 | +0.11% | 0.18% | 0.78% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q21 | parquet / none / none | 17.82 | 17.81 | +0.07% | 0.66% | 0.58% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q4 | parquet / none / none | 5.12 | 5.12 | -0.02% | 0.97% | 1.23% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q9 | parquet / none / none | 23.85 | 23.88 | -0.15% | 0.72% | 0.22% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q12 | parquet / none / none | 6.15 | 6.16 | -0.16% | 1.60% | 1.72% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q3 | parquet / none / none | 5.46 | 5.47 | -0.23% | 1.28% | 0.90% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q16 | parquet / none / none | 3.61 | 3.62 | -0.26% | 1.00% | 1.36% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q19 | parquet / none / none | 20.19 | 20.31 | -0.58% | 1.67% | 0.65% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q7 | parquet / none / none | 9.42 | 9.48 | -0.68% | 0.87% | 0.71% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q18 | parquet / none / none | 12.94 | 13.06 | -0.90% | 0.59% | 0.48% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q22 | parquet / none / none | 1.09 | 1.10 | -0.92% | 2.26% | 2.22% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q13 | parquet / none / none | 3.75 | 3.78 | -0.94% | 2.04% | 2.86% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q20 | parquet / none / none | 4.33 | 4.37 | -1.10% | 3.00% | 2.43% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q2 | parquet / none / none | 2.39 | 2.42 | -1.38% | 1.54% | 1.30% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q11 | parquet / none / none | 1.43 | 1.46 | -1.78% | 2.05% | 2.77% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q6 | parquet / none / none | 2.29 | 2.33 | -1.79% | 0.56% | 1.23% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q15 | parquet / none / none | 5.04 | 5.13 | -1.84% | 0.61% | 2.01% | 1 | 10 | | TPCH_NESTED(_20) | TPCH-Q1 | parquet / none / none | 5.98 | 6.12 | -2.30% | 1.84% | 3.19% | 1 | 10 | +------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475 Reviewed-on: http://gerrit.cloudera.org:8080/3842 Reviewed-by: Tim Armstrong <[email protected]> Reviewed-by: Marcel Kornacker <[email protected]> Tested-by: Internal Jenkins --- M be/src/exec/CMakeLists.txt M be/src/exec/blocking-join-node.cc M be/src/exec/blocking-join-node.h M be/src/exec/data-sink.cc M be/src/exec/data-sink.h M be/src/exec/hash-join-node.cc M be/src/exec/hash-join-node.h M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-sink.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/kudu-table-sink-test.cc M be/src/exec/kudu-table-sink.cc M be/src/exec/kudu-table-sink.h A be/src/exec/nested-loop-join-builder.cc A be/src/exec/nested-loop-join-builder.h M be/src/exec/nested-loop-join-node.cc M be/src/exec/nested-loop-join-node.h M be/src/exec/partitioned-hash-join-node.cc M be/src/exec/partitioned-hash-join-node.h M be/src/exec/row-batch-cache.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/runtime/row-batch.cc M be/src/runtime/row-batch.h M be/src/util/stopwatch.h 29 files changed, 718 insertions(+), 461 deletions(-) Approvals: Marcel Kornacker: Looks good to me, approved Internal Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/3842 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]>
