Hello Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/15096 to look at the new patch set (#17). Change subject: IMPALA-9156: share broadcast join builds ...................................................................... IMPALA-9156: share broadcast join builds The scheduler will only create one join build finstance per backend in cases where this is supported. The builder is aware of the number of finstances executing the probe and hands off the build data structures to the builders. Nested loop join requires minimal modifications because the build data structures are read-only after initial construction. The only significant change is that memory can't be transferred to the multiple consumers, so MarkNeedsDeepCopy() needs to be used instead. Hash join requires additional synchronisation because the spilling algorithm mutates build-side data structures. This patch adds synchronisation so that rebuilding spilled partitions is done in a thread-safe manner, using a single thread. This uses the CyclicBarrier added in an earlier patch. Threads blocked on CyclicBarrier need to be cancellable, which is handled by cancelling the barrier when cancelling fragments on the backend. BufferPool now correctly handles multiple threads calling CleanPages() concurrently, which makes other methods thread-safe. Update planner to cost broadcast join and estimate memory consumption based on a single instance per node. Planner estimates of number of instances are improved. Instead of assuming mt_dop instances per node, use the total number of input splits (also called scan ranges in places) as an upper bound on the number of instances generated by scans. These instance estimates from the scan nodes are then propagated up the plan tree in the same way as the numNodes estimates. The instance estimate for the join build fragment is fixed to be based on the destination fragment. The profile now correctly accounts for time waiting for the builder, counting it in inactive time and showing it in the node timeline. Additional improvements/cleanup to the time accounting are deferring until IMPALA-9422. Testing: * Updated planner tests * Ran a single node stress test with TPC-H and TPC-DS * Add a targeted test for spilling broadcast joins, both repartitioning and not repartitioning. * Add a targeted test for a spilling broadcast join with empty probe * Add a targeted test for spilling broadcast join with empty build partitions. * Add a broadcast join to test_cancellation and test_failpoints. Perf: I did a single node run on my desktop: +----------+-----------------------+---------+------------+------------+----------------+ | Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +----------+-----------------------+---------+------------+------------+----------------+ | TPCH(30) | parquet / none / none | 6.26 | -15.70% | 4.63 | -16.16% | +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ | TPCH(30) | TPCH-Q21 | parquet / none / none | 24.97 | 23.25 | R +7.38% | 0.51% | 0.22% | 5 | R +6.95% | 2.31 | 27.93 | | TPCH(30) | TPCH-Q4 | parquet / none / none | 2.83 | 2.79 | +1.31% | 1.86% | 0.36% | 5 | +1.88% | 1.15 | 1.53 | | TPCH(30) | TPCH-Q6 | parquet / none / none | 1.28 | 1.28 | -0.01% | 1.64% | 1.63% | 5 | -0.11% | -0.58 | -0.01 | | TPCH(30) | TPCH-Q22 | parquet / none / none | 2.65 | 2.68 | -0.94% | 0.84% | 1.46% | 5 | -0.21% | -0.87 | -1.25 | | TPCH(30) | TPCH-Q1 | parquet / none / none | 4.69 | 4.72 | -0.56% | 1.29% | 0.52% | 5 | -1.04% | -1.15 | -0.89 | | TPCH(30) | TPCH-Q13 | parquet / none / none | 10.64 | 10.80 | -1.48% | 0.61% | 0.60% | 5 | -1.39% | -1.73 | -3.91 | | TPCH(30) | TPCH-Q15 | parquet / none / none | 4.11 | 4.32 | -4.92% | 0.05% | 0.40% | 5 | -4.93% | -2.31 | -27.46 | | TPCH(30) | TPCH-Q20 | parquet / none / none | 3.47 | 3.67 | I -5.41% | 0.81% | 0.03% | 5 | I -5.70% | -2.31 | -15.75 | | TPCH(30) | TPCH-Q17 | parquet / none / none | 7.58 | 8.14 | I -6.93% | 3.13% | 2.62% | 5 | I -9.31% | -2.02 | -3.96 | | TPCH(30) | TPCH-Q9 | parquet / none / none | 15.59 | 17.02 | I -8.38% | 0.95% | 0.43% | 5 | I -8.92% | -2.31 | -19.37 | | TPCH(30) | TPCH-Q14 | parquet / none / none | 2.90 | 3.25 | I -10.93% | 1.42% | 4.41% | 5 | I -10.28% | -2.31 | -5.33 | | TPCH(30) | TPCH-Q12 | parquet / none / none | 2.69 | 3.13 | I -14.31% | 4.50% | 1.40% | 5 | I -17.79% | -2.31 | -7.80 | | TPCH(30) | TPCH-Q16 | parquet / none / none | 2.50 | 3.03 | I -17.54% | 0.10% | 0.79% | 5 | I -20.55% | -2.31 | -49.31 | | TPCH(30) | TPCH-Q10 | parquet / none / none | 4.76 | 5.92 | I -19.52% | 0.78% | 0.33% | 5 | I -24.31% | -2.31 | -61.63 | | TPCH(30) | TPCH-Q2 | parquet / none / none | 2.56 | 3.33 | I -23.18% | 2.13% | 0.85% | 5 | I -30.39% | -2.31 | -28.14 | | TPCH(30) | TPCH-Q18 | parquet / none / none | 12.59 | 16.41 | I -23.26% | 1.73% | 0.90% | 5 | I -30.43% | -2.31 | -32.36 | | TPCH(30) | TPCH-Q11 | parquet / none / none | 1.83 | 2.41 | I -24.04% | 1.83% | 2.22% | 5 | I -30.48% | -2.31 | -20.54 | | TPCH(30) | TPCH-Q8 | parquet / none / none | 4.43 | 5.94 | I -25.33% | 0.96% | 0.54% | 5 | I -34.54% | -2.31 | -63.01 | | TPCH(30) | TPCH-Q5 | parquet / none / none | 3.81 | 5.37 | I -29.08% | 1.43% | 0.69% | 5 | I -41.47% | -2.31 | -53.11 | | TPCH(30) | TPCH-Q7 | parquet / none / none | 13.34 | 21.49 | I -37.92% | 0.46% | 0.30% | 5 | I -60.69% | -2.31 | -203.08 | | TPCH(30) | TPCH-Q3 | parquet / none / none | 4.73 | 7.73 | I -38.81% | 4.90% | 1.35% | 5 | I -61.68% | -2.31 | -26.40 | | TPCH(30) | TPCH-Q19 | parquet / none / none | 3.71 | 6.61 | I -43.83% | 1.63% | 0.09% | 5 | I -77.12% | -2.31 | -106.61 | +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ Change-Id: I4c67e4b2c87ed0fba648f1e1710addb885d66dc7 --- M be/src/exec/blocking-join-node.cc M be/src/exec/join-builder.cc M be/src/exec/join-builder.h M be/src/exec/join-op.h M be/src/exec/nested-loop-join-node.cc M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-builder.h M be/src/exec/partitioned-hash-join-node.cc M be/src/exec/partitioned-hash-join-node.h M be/src/runtime/bufferpool/buffer-pool-internal.h M be/src/runtime/bufferpool/buffer-pool.cc M be/src/runtime/bufferpool/buffer-pool.h M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler.cc M be/src/util/cyclic-barrier-test.cc M be/src/util/cyclic-barrier.h M common/thrift/DataSinks.thrift M common/thrift/ImpalaInternalService.thrift M fe/src/main/java/org/apache/impala/planner/AggregationNode.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/EmptySetNode.java M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/JoinBuildSink.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ParallelPlanner.java M fe/src/main/java/org/apache/impala/planner/PlanFragment.java M fe/src/main/java/org/apache/impala/planner/PlanNode.java M fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java M fe/src/main/java/org/apache/impala/planner/UnionNode.java M fe/src/main/java/org/apache/impala/planner/UnnestNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/max-row-size.test M testdata/workloads/functional-planner/queries/PlannerTest/mem-limit-broadcast-join.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-enabled.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test A testdata/workloads/functional-query/queries/QueryTest/spilling-broadcast-joins.test M tests/common/test_dimensions.py M tests/failure/test_failpoints.py M tests/query_test/test_cancellation.py M tests/query_test/test_spilling.py 53 files changed, 1,319 insertions(+), 773 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/15096/17 -- To view, visit http://gerrit.cloudera.org:8080/15096 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4c67e4b2c87ed0fba648f1e1710addb885d66dc7 Gerrit-Change-Number: 15096 Gerrit-PatchSet: 17 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>