[
https://issues.apache.org/jira/browse/HIVE-29613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Araika Singh updated HIVE-29613:
--------------------------------
Description:
*Background:*
{color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join
into a broadcast map join or leave it as a shuffle merge join. For joins the
optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp)
== true{color}), an additional row-count gate is applied: the build side's
estimated row count must not exceed
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When the
estimate exceeds this threshold, the cross-product map-join conversion is
rejected outright and the join falls back through
{color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a
shuffle merge join.
In practice, the build side of a tiny lookup-style table can end up estimated
at 2 or 3 rows after CBO pushes down predicates and applies NDV-based
selectivity, even when its actual byte footprint is a few hundred bytes —
easily within the existing broadcast byte budget
{color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. NDV-driven
filter estimates routinely overshoot the row threshold by a small margin on
tiny tables. When this happens the row-only gate rejects a broadcast that would
have been entirely safe by the same byte budget that map-join conversion
already uses elsewhere.
The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with
{color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product offers
nothing real to shuffle on, the shuffle collapses to a single reduce key, and
the entire output of the big side is materialised on a single reducer task. The
downstream symptom on real-world data is a job pinned indefinitely at ~98–99%
with one reducer running while the rest finish. We have observed this on a join
of a ~29 Million-row, ~9 GB big-side table against a build side filtered down
to 2 estimated rows / a few hundred bytes {color:#4c9aff}onlineDataSize{color}
— the small side fits the byte budget by ~7 orders of magnitude, yet the
row-only gate rejects the broadcast because 2 > 1:
{code:java}
ConvertJoinMapJoin#getMapJoinConversion
if (parentStats.getNumRows() >
HiveConf.getIntVar(context.conf,
HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
// if any of smaller side is estimated to generate more than
// threshold rows we would disable mapjoin
return null;
} {code}
At compile time the optimizer emits, three times in a row, the two log lines:
{code:java}
INFO optimizer.ConvertJoinMapJoin - Could not get a valid join position.
Defaulting to position 0
INFO optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
....
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product
{code}
The root cause is that the cross-product gate consults only the row count of
the build side, not its byte footprint, and NDV-driven row estimates on tiny
tables are exactly the case where the byte footprint is small but the row count
overshoots the threshold by 1.
*Proposed Fix:*
Relax the cross-product gate — when the row estimate exceeds
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte check
against {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}
(the same budget map-join conversion already uses). If the small side's
computeOnlineDataSize fits the budget, allow the conversion. If it doesn't,
reject as before.
The log line is emitted at compile time when the byte-fallback branch admits,
so the new path is observable in HS2 logs:
{code:java}
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product{code}
*Risk and scope:*
- The change is confined to the cross-product branch of
{color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}.
Non-cross-product joins are unaffected.
- For cross products whose build side is genuinely large in bytes, the new
check rejects with the same outcome as today.
- The change is gated by
{color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) —
clusters that have cartesian-product edges disabled don't enter this branch at
all.
- No public API or configuration surface is added.
was:
*Background:*
{color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join
into a broadcast map join or leave it as a shuffle merge join. For joins the
optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp)
== true{color}), an additional row-count gate is applied: the build side's
estimated row count must not exceed
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When the
estimate exceeds this threshold, the cross-product map-join conversion is
rejected outright and the join falls back through
{color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a
shuffle merge join.
In practice, the build side of a tiny lookup-style table can end up estimated
at 2 or 3 rows after CBO pushes down predicates and applies NDV-based
selectivity, even when its actual byte footprint is a few hundred bytes —
easily within the existing broadcast byte budget
{color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. NDV-driven
filter estimates routinely overshoot the row threshold by a small margin on
tiny tables. When this happens the row-only gate rejects a broadcast that would
have been entirely safe by the same byte budget that map-join conversion
already uses elsewhere.
The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with
{color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product offers
nothing real to shuffle on, the shuffle collapses to a single reduce key, and
the entire output of the big side is materialised on a single reducer task. The
downstream symptom on real-world data is a job pinned indefinitely at ~98–99%
with one reducer running while the rest finish. We have observed this on a join
of a ~29 Million-row, ~9 GB big-side table against a build side filtered down
to 2 estimated rows / a few hundred bytes {color:#4c9aff}onlineDataSize{color}
— the small side fits the byte budget by ~7 orders of magnitude, yet the
row-only gate rejects the broadcast because 2 > 1:
{code:java}
ConvertJoinMapJoin#getMapJoinConversion
if (parentStats.getNumRows() >
HiveConf.getIntVar(context.conf,
HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
// if any of smaller side is estimated to generate more than
// threshold rows we would disable mapjoin
return null;
} {code}
At compile time the optimizer emits, three times in a row, the two log lines:
{code:java}
INFO optimizer.ConvertJoinMapJoin - Could not get a valid join position.
Defaulting to position 0
INFO optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
....
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product
{code}
The root cause is that the cross-product gate consults only the row count of
the build side, not its byte footprint, and NDV-driven row estimates on tiny
tables are exactly the case where the byte footprint is small but the row count
overshoots the threshold by 1.
*Proposed Fix:*
Relax the cross-product gate — when the row estimate exceeds
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte check
against {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}
(the same budget map-join conversion already uses). If the small side's
computeOnlineDataSize fits the budget, allow the conversion. If it doesn't,
reject as before.
The log line is emitted at compile time when the byte-fallback branch admits,
so the new path is observable in HS2 logs:
{code:java}
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product{code}
*Risk and scope:*
- The change is confined to the cross-product branch of
{color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}.
Non-cross-product joins are unaffected.
- For cross products whose build side is genuinely large in bytes, the new
check rejects with the same outcome as today.
- The change is gated by
{color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) —
clusters that have cartesian-product edges disabled don't enter this branch at
all.
- No public API or configuration surface is added.
> Cross-product join falls back to single-reducer shuffle merge when small-side
> row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29613
> URL: https://issues.apache.org/jira/browse/HIVE-29613
> Project: Hive
> Issue Type: Bug
> Reporter: Araika Singh
> Assignee: Araika Singh
> Priority: Major
>
> *Background:*
> {color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join
> into a broadcast map join or leave it as a shuffle merge join. For joins the
> optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp)
> == true{color}), an additional row-count gate is applied: the build side's
> estimated row count must not exceed
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When
> the estimate exceeds this threshold, the cross-product map-join conversion is
> rejected outright and the join falls back through
> {color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a
> shuffle merge join.
> In practice, the build side of a tiny lookup-style table can end up estimated
> at 2 or 3 rows after CBO pushes down predicates and applies NDV-based
> selectivity, even when its actual byte footprint is a few hundred bytes —
> easily within the existing broadcast byte budget
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}.
> NDV-driven filter estimates routinely overshoot the row threshold by a small
> margin on tiny tables. When this happens the row-only gate rejects a
> broadcast that would have been entirely safe by the same byte budget that
> map-join conversion already uses elsewhere.
> The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with
> {color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product
> offers nothing real to shuffle on, the shuffle collapses to a single reduce
> key, and the entire output of the big side is materialised on a single
> reducer task. The downstream symptom on real-world data is a job pinned
> indefinitely at ~98–99% with one reducer running while the rest finish. We
> have observed this on a join of a ~29 Million-row, ~9 GB big-side table
> against a build side filtered down to 2 estimated rows / a few hundred bytes
> {color:#4c9aff}onlineDataSize{color} — the small side fits the byte budget by
> ~7 orders of magnitude, yet the row-only gate rejects the broadcast because 2
> > 1:
> {code:java}
> ConvertJoinMapJoin#getMapJoinConversion
> if (parentStats.getNumRows() >
> HiveConf.getIntVar(context.conf,
> HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
> // if any of smaller side is estimated to generate more than
> // threshold rows we would disable mapjoin
> return null;
> } {code}
> At compile time the optimizer emits, three times in a row, the two log lines:
> {code:java}
> INFO optimizer.ConvertJoinMapJoin - Could not get a valid join position.
> Defaulting to position 0
> INFO optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
> ....
> INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer
> 2' is a cross product
> {code}
> The root cause is that the cross-product gate consults only the row count of
> the build side, not its byte footprint, and NDV-driven row estimates on tiny
> tables are exactly the case where the byte footprint is small but the row
> count overshoots the threshold by 1.
> *Proposed Fix:*
> Relax the cross-product gate — when the row estimate exceeds
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte
> check against
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} (the same
> budget map-join conversion already uses). If the small side's
> computeOnlineDataSize fits the budget, allow the conversion. If it doesn't,
> reject as before.
> The log line is emitted at compile time when the byte-fallback branch admits,
> so the new path is observable in HS2 logs:
> {code:java}
> INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer
> 2' is a cross product{code}
> *Risk and scope:*
> - The change is confined to the cross-product branch of
> {color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}.
> Non-cross-product joins are unaffected.
> - For cross products whose build side is genuinely large in bytes, the new
> check rejects with the same outcome as today.
> - The change is gated by
> {color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag)
> — clusters that have cartesian-product edges disabled don't enter this branch
> at all.
> - No public API or configuration surface is added.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)