deniskuzZ commented on PR #6538:
URL: https://github.com/apache/hive/pull/6538#issuecomment-4708113727
running with set hive.transpose.aggr.join=true (default is false).
It enables Hive's Calcite rule HiveAggregateJoinTransposeRule, which pushes
an aggregation below a join — i.e., aggregate first, then join, instead of join
then aggregate.
Concrete example (q4): the query joins store_sales/catalog_sales/web_sales
to customer and then does SUM(...) GROUP BY c_customer_id, ….
- transpose-OFF: join the full fact rows to customer first (~539M rows),
then aggregate.
- transpose-ON: push the per-channel SUM below the customer join, so the
wide GROUP BY collapses ~539M → a few million rows before the join. The plan
signature: the customer side picks up a count() and the join emits CAST(sum *
count AS decimal).
TPCDS results:
- Wins: q2 (139→20), q4, q22, q71, q86, q59 — when the pre-aggregation
genuinely shrinks the join input.
- Regressions (~14): q47 (89→25 OFF-is-better), q78, q98, q25, q51, q87… —
when pushing the agg down doesn't shrink things but adds a wide-key shuffle.
The root cause of the regressions: Hive's default cost model is
cardinality-only (cpu=io=0), so the rule fires on tiny rowcount differences and
can't tell a beneficial transpose from a harmful one. That's why "best-of
per-query" beats turning it globally ON or OFF.
Turning on `hive.cbo.costmodel.extended` makes cpu/io non-zero everywhere,
which changes the entire CBO, not just the transpose rule — join ordering, join
algorithm selection, etc. On the 5-query cherry-picked subset
(q4/q14/q47/q78/q98) it won (458 vs 471 OFF vs 574 ON) because those were
hand-picked transpose regressors.
But across all 99, the new regressions it introduces elsewhere outweigh the
transpose decisions it fixes — most visibly it regressed q4 via join-order
(~88→105s), nothing to do with transpose. Broad blast radius = net loss.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]