[ 
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059984#comment-18059984
 ] 

Konstantin Bereznyakov commented on HIVE-29473:
-----------------------------------------------

The integration test [^lateral_view_nested_stats_bug.q]  and its results 
[^lateral_view_nested_stats_bug.q.out] highlight the problem:

{quote}
Group By Operator
  keys: KEY._col0 (type: string)
  ...
  Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: 
COMPLETE
{quote}
Why it is inaccurate: The estimate shows 6 rows because of a _col0 name 
collision in the columnExprMap. When getColStatisticsFromExprMap() looks up 
statistics for the grouping key (id), it resolves _col0 to f1's column 
expression instead of id's. Since f1 has NDV=6, the CBO incorrectly uses 6 as 
the grouping key's distinct value count, producing an estimate of 6 rows 
instead of the correct 2.

A query with "unknown" NDV (forced to 0:
{quote}
Group By Operator
  keys: KEY._col0 (type: string)
  ...
  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: 
COMPLETE
{quote}
Why it is inaccurate: The estimate shows 3 rows because of the same _col0 name 
collision in the columnExprMap. When getColStatisticsFromExprMap() looks up 
statistics for the grouping key (id), it resolves _col0 to f1's column 
expression instead of id's. Since f1 has NDV=3, the CBO uses 3 as the grouping 
key's distinct value count. The explicit NDV=0 we set on id is never consulted, 
so the fallback logic is never triggered, producing an estimate of 3 rows 
instead of the correct 6.


> LateralViewJoinStatsRule combines stats of unrelated columns on 2+ LV 
> queries, corrupting CBO estimates
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29473
>                 URL: https://issues.apache.org/jira/browse/HIVE-29473
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>         Attachments: lateral_view_nested_stats_bug.q, 
> lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as 
> nested {{{}posexplode{}}}s), the CBO cardinality estimation could be severely 
> underestimated for downstream operators (like {{{}Group By{}}}). This loss of 
> statistical accuracy leads to suboptimal execution plans, poor join choices, 
> and potential resource starvation during execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
> When merging statistics, the rule passes the global {{columnExprMap}} to 
> {{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.
> Because the UDTF branch is built in isolation, its internal column generator 
> restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates 
> a namespace collision with the base table's internal columns (which also use 
> {{{}_col0{}}}, etc.). The utility method blindly matches these keys, which 
> *causes the CBO to combine the statistics of completely unrelated columns* 
> (e.g., merging the base table's {{id}} column with the UDTF's exploded array 
> column). As a result, the UDTF's empty or zeroed statistics silently 
> overwrite the base table's healthy statistics.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping 
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
> we create a firewall. {{StatsUtils}} will now only evaluate expressions that 
> mathematically belong to that specific branch, preventing the namespace 
> collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to