[ 
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
    Description: 
*Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
Optimizer (CBO) can generate inaccurate cardinality and data size estimates for 
downstream operators (such as {{{}Group By{}}}). This reduction in statistical 
accuracy—typically manifesting as artificially inflated row counts and data 
sizes—can lead to suboptimal execution plans, poor join strategy selections, 
and inefficient resource allocation during query execution.

*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
Specifically, the rule passes the same {{columnExprMap}} and full {{RowSchema}} 
to both branches.

Because the UDTF branch is compiled in isolation, its internal column generator 
restarts at 0. The bug manifests during the UDTF branch evaluation because the 
utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) incorrectly 
matches the UDTF's statistics against the {{SELECT}} branch's identical column 
names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This direct namespace 
collision *causes the CBO to combine the statistics of completely unrelated 
columns* (e.g., combining a base table's string key with a UDTF's exploded 
array column via {*}joinedStats.addToColumnStats(){*}). Because the underlying 
merge algorithm applies maximum-value semantics to overlapping keys, a 
generated UDTF column with a larger NDV or {{avgColLen}} will silently 
overwrite the base table's true metrics, artificially inflating the downstream 
cardinality and data size estimates.

*Proposed Fix:* Enforce strict parent operator boundaries before mapping 
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
we establish strict namespace isolation. {{StatsUtils}} will now only evaluate 
expressions that mathematically belong to that specific branch, preventing the 
cross-branch namespace collision entirely.

  was:
*Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
Optimizer (CBO) can generate inaccurate cardinality and data size estimates for 
downstream operators (such as {{{}Group By{}}}). This reduction in statistical 
accuracy—typically manifesting as artificially inflated row counts and data 
sizes—can lead to suboptimal execution plans, poor join strategy selections, 
and inefficient resource allocation during query execution.

*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
Specifically, the rule passes the same {{columnExprMap}} and full {{RowSchema}} 
to both branches.

Because the UDTF branch is compiled in isolation, its internal column generator 
restarts at 0. The bug manifests during the UDTF branch evaluation because the 
utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) incorrectly 
matches the UDTF's statistics against the {{SELECT}} branch's identical column 
names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This direct namespace 
collision *causes the CBO to combine the statistics of completely unrelated 
columns* (e.g., combining a base table's string key with a UDTF's exploded 
array column). Because the underlying merge algorithm applies maximum-value 
semantics to overlapping keys, a generated UDTF column with a larger NDV or 
{{avgColLen}} will silently overwrite the base table's true metrics, 
artificially inflating the downstream cardinality and data size estimates.

*Proposed Fix:* Enforce strict parent operator boundaries before mapping 
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
we establish strict namespace isolation. {{StatsUtils}} will now only evaluate 
expressions that mathematically belong to that specific branch, preventing the 
cross-branch namespace collision entirely.


> CBO: LateralViewJoinStatsRule unintentionally combines base table and UDTF 
> column stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-29473
>                 URL: https://issues.apache.org/jira/browse/HIVE-29473
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: FIXED lateral_view_nested_stats_bug.q.out, 
> lateral_view_nested_stats_bug.q, lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates 
> for downstream operators (such as {{{}Group By{}}}). This reduction in 
> statistical accuracy—typically manifesting as artificially inflated row 
> counts and data sizes—can lead to suboptimal execution plans, poor join 
> strategy selections, and inefficient resource allocation during query 
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
> Specifically, the rule passes the same {{columnExprMap}} and full 
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column 
> generator restarts at 0. The bug manifests during the UDTF branch evaluation 
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) 
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's 
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This 
> direct namespace collision *causes the CBO to combine the statistics of 
> completely unrelated columns* (e.g., combining a base table's string key with 
> a UDTF's exploded array column via {*}joinedStats.addToColumnStats(){*}). 
> Because the underlying merge algorithm applies maximum-value semantics to 
> overlapping keys, a generated UDTF column with a larger NDV or {{avgColLen}} 
> will silently overwrite the base table's true metrics, artificially inflating 
> the downstream cardinality and data size estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping 
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
> we establish strict namespace isolation. {{StatsUtils}} will now only 
> evaluate expressions that mathematically belong to that specific branch, 
> preventing the cross-branch namespace collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to