[jira] [Updated] (HIVE-29613) Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows

Araika Singh (Jira) Thu, 14 May 2026 04:55:55 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-29613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Araika Singh updated HIVE-29613:
--------------------------------
    Description: 
*Background:*

{color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join 
into a broadcast map join or leave it as a shuffle merge join. For joins the 
optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp) 
== true{color}), an additional row-count gate is applied: the build side's 
estimated row count must not exceed 
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When the 
estimate exceeds this threshold, the cross-product map-join conversion is 
rejected outright and the join falls back through 
{color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a 
shuffle merge join.

In practice, the build side of a tiny lookup-style table can end up estimated 
at 2 or 3 rows after CBO pushes down predicates and applies NDV-based 
selectivity, even when its actual byte footprint is a few hundred bytes — 
easily within the existing broadcast byte budget 
{color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. NDV-driven 
filter estimates routinely overshoot the row threshold by a small margin on 
tiny tables. When this happens the row-only gate rejects a broadcast that would 
have been entirely safe by the same byte budget that map-join conversion 
already uses elsewhere.

The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with 
{color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product offers 
nothing real to shuffle on, the shuffle collapses to a single reduce key, and 
the entire output of the big side is materialised on a single reducer task. The 
downstream symptom on real-world data is a job pinned indefinitely at ~98–99% 
with one reducer running while the rest finish. We have observed this on a join 
of a ~29 Million-row, ~9 GB big-side table against a build side filtered down 
to 2 estimated rows / a few hundred bytes {color:#4c9aff}onlineDataSize{color} 
— the small side fits the byte budget by ~7 orders of magnitude, yet the 
row-only gate rejects the broadcast because 2 > 1:
{code:java}
ConvertJoinMapJoin#getMapJoinConversion

if (parentStats.getNumRows() >
       HiveConf.getIntVar(context.conf, 
HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
     // if any of smaller side is estimated to generate more than
     // threshold rows we would disable mapjoin
     return null;
   } {code}
At compile time the optimizer emits, three times in a row, the two log lines:
{code:java}
INFO  optimizer.ConvertJoinMapJoin - Could not get a valid join position. 
Defaulting to position 0
INFO  optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
....
INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2' 
is a cross product
{code}
The root cause is that the cross-product gate consults only the row count of 
the build side, not its byte footprint, and NDV-driven row estimates on tiny 
tables are exactly the case where the byte footprint is small but the row count 
overshoots the threshold by 1.

*Proposed Fix:*

Relax the cross-product gate — when the row estimate exceeds 
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte check 
against {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} 
(the same budget map-join conversion already uses). If the small side's 
computeOnlineDataSize fits the budget, allow the conversion. If it doesn't, 
reject as before.

The log line is emitted at compile time when the byte-fallback branch admits, 
so the new path is observable in HS2 logs:
{code:java}
INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2' 
is a cross product{code}
*Risk and scope:*
 - The change is confined to the cross-product branch of 
{color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}. 
Non-cross-product joins are unaffected.
 - For cross products whose build side is genuinely large in bytes, the new 
check rejects with the same outcome as today.
 - The change is gated by 
{color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) — 
clusters that have cartesian-product edges disabled don't enter this branch at 
all.
 - No public API or configuration surface is added.

  was:
*Background:*

{color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join 
into a broadcast map join or leave it as a shuffle merge join. For joins the 
optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp) 
== true{color}), an additional row-count gate is applied: the build side's 
estimated row count must not exceed 
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When the 
estimate exceeds this threshold, the cross-product map-join conversion is 
rejected outright and the join falls back through 
{color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a 
shuffle merge join.

In practice, the build side of a tiny lookup-style table can end up estimated 
at 2 or 3 rows after CBO pushes down predicates and applies NDV-based 
selectivity, even when its actual byte footprint is a few hundred bytes — 
easily within the existing broadcast byte budget 
{color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. NDV-driven 
filter estimates routinely overshoot the row threshold by a small margin on 
tiny tables. When this happens the row-only gate rejects a broadcast that would 
have been entirely safe by the same byte budget that map-join conversion 
already uses elsewhere.

The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with 
{color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product offers 
nothing real to shuffle on, the shuffle collapses to a single reduce key, and 
the entire output of the big side is materialised on a single reducer task. The 
downstream symptom on real-world data is a job pinned indefinitely at ~98–99% 
with one reducer running while the rest finish. We have observed this on a join 
of a ~29 Million-row, ~9 GB big-side table against a build side filtered down 
to 2 estimated rows / a few hundred bytes {color:#4c9aff}onlineDataSize{color} 
— the small side fits the byte budget by ~7 orders of magnitude, yet the 
row-only gate rejects the broadcast because 2 > 1:
{code:java}
ConvertJoinMapJoin#getMapJoinConversion

if (parentStats.getNumRows() >
       HiveConf.getIntVar(context.conf, 
HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
     // if any of smaller side is estimated to generate more than
     // threshold rows we would disable mapjoin
     return null;
   } {code}
 

At compile time the optimizer emits, three times in a row, the two log lines:
{code:java}
INFO  optimizer.ConvertJoinMapJoin - Could not get a valid join position. 
Defaulting to position 0
INFO  optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
....
INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2' 
is a cross product
{code}
The root cause is that the cross-product gate consults only the row count of 
the build side, not its byte footprint, and NDV-driven row estimates on tiny 
tables are exactly the case where the byte footprint is small but the row count 
overshoots the threshold by 1.

*Proposed Fix:*

Relax the cross-product gate — when the row estimate exceeds 
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte check 
against {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} 
(the same budget map-join conversion already uses). If the small side's 
computeOnlineDataSize fits the budget, allow the conversion. If it doesn't, 
reject as before.

The log line is emitted at compile time when the byte-fallback branch admits, 
so the new path is observable in HS2 logs:
{code:java}
INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2' 
is a cross product{code}
*Risk and scope:*

- The change is confined to the cross-product branch of 
{color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}. 
Non-cross-product joins are unaffected.
- For cross products whose build side is genuinely large in bytes, the new 
check rejects with the same outcome as today.
- The change is gated by 
{color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) — 
clusters that have cartesian-product edges disabled don't enter this branch at 
all.
- No public API or configuration surface is added.


> Cross-product join falls back to single-reducer shuffle merge when small-side 
> row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29613
>                 URL: https://issues.apache.org/jira/browse/HIVE-29613
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Araika Singh
>            Assignee: Araika Singh
>            Priority: Major
>
> *Background:*
> {color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join 
> into a broadcast map join or leave it as a shuffle merge join. For joins the 
> optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp) 
> == true{color}), an additional row-count gate is applied: the build side's 
> estimated row count must not exceed 
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When 
> the estimate exceeds this threshold, the cross-product map-join conversion is 
> rejected outright and the join falls back through 
> {color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a 
> shuffle merge join.
> In practice, the build side of a tiny lookup-style table can end up estimated 
> at 2 or 3 rows after CBO pushes down predicates and applies NDV-based 
> selectivity, even when its actual byte footprint is a few hundred bytes — 
> easily within the existing broadcast byte budget 
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. 
> NDV-driven filter estimates routinely overshoot the row threshold by a small 
> margin on tiny tables. When this happens the row-only gate rejects a 
> broadcast that would have been entirely safe by the same byte budget that 
> map-join conversion already uses elsewhere.
> The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with 
> {color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product 
> offers nothing real to shuffle on, the shuffle collapses to a single reduce 
> key, and the entire output of the big side is materialised on a single 
> reducer task. The downstream symptom on real-world data is a job pinned 
> indefinitely at ~98–99% with one reducer running while the rest finish. We 
> have observed this on a join of a ~29 Million-row, ~9 GB big-side table 
> against a build side filtered down to 2 estimated rows / a few hundred bytes 
> {color:#4c9aff}onlineDataSize{color} — the small side fits the byte budget by 
> ~7 orders of magnitude, yet the row-only gate rejects the broadcast because 2 
> > 1:
> {code:java}
> ConvertJoinMapJoin#getMapJoinConversion
> if (parentStats.getNumRows() >
>        HiveConf.getIntVar(context.conf, 
> HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
>      // if any of smaller side is estimated to generate more than
>      // threshold rows we would disable mapjoin
>      return null;
>    } {code}
> At compile time the optimizer emits, three times in a row, the two log lines:
> {code:java}
> INFO  optimizer.ConvertJoinMapJoin - Could not get a valid join position. 
> Defaulting to position 0
> INFO  optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
> ....
> INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 
> 2' is a cross product
> {code}
> The root cause is that the cross-product gate consults only the row count of 
> the build side, not its byte footprint, and NDV-driven row estimates on tiny 
> tables are exactly the case where the byte footprint is small but the row 
> count overshoots the threshold by 1.
> *Proposed Fix:*
> Relax the cross-product gate — when the row estimate exceeds 
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte 
> check against 
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} (the same 
> budget map-join conversion already uses). If the small side's 
> computeOnlineDataSize fits the budget, allow the conversion. If it doesn't, 
> reject as before.
> The log line is emitted at compile time when the byte-fallback branch admits, 
> so the new path is observable in HS2 logs:
> {code:java}
> INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 
> 2' is a cross product{code}
> *Risk and scope:*
>  - The change is confined to the cross-product branch of 
> {color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}. 
> Non-cross-product joins are unaffected.
>  - For cross products whose build side is genuinely large in bytes, the new 
> check rejects with the same outcome as today.
>  - The change is gated by 
> {color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) 
> — clusters that have cartesian-product edges disabled don't enter this branch 
> at all.
>  - No public API or configuration surface is added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-29613) Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows

Reply via email to