[ https://issues.apache.org/jira/browse/HIVE-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007343#comment-18007343 ]
Konstantin Bereznyakov commented on HIVE-29084: ----------------------------------------------- uzing Tez, EXPLAN EXTENDED with CBO ON/OFF: {code} SET hive.cbo.enable=true; No rows affected (0.007 seconds) 0: jdbc:hive2://localhost:10000> SET hive.cbo.enable; +-----------------------+ | set | +-----------------------+ | hive.cbo.enable=true | +-----------------------+ 1 row selected (0.01 seconds) 0: jdbc:hive2://localhost:10000> EXPLAIN EXTENDED SELECT first_val, second_val . . . . . . . . . . . . . . . .> FROM (SELECT array('a', 'b') as val_array) inline_data . . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv1 AS first_val . . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv2 AS second_val . . . . . . . . . . . . . . . .> WHERE first_val != second_val; INFO : Compiling command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b): EXPLAIN EXTENDED SELECT first_val, second_val FROM (SELECT array('a', 'b') as val_array) inline_data LATERAL VIEW explode(val_array) lv1 AS first_val LATERAL VIEW explode(val_array) lv2 AS second_val WHERE first_val != second_val INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b); Time taken: 0.216 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b): EXPLAIN EXTENDED SELECT first_val, second_val FROM (SELECT array('a', 'b') as val_array) inline_data LATERAL VIEW explode(val_array) lv1 AS first_val LATERAL VIEW explode(val_array) lv2 AS second_val WHERE first_val != second_val INFO : Starting task [Stage-1:EXPLAIN] in serial mode INFO : Completed executing command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b); Time taken: 0.031 seconds +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-0 is a root stage | | | | STAGE PLANS: | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | TableScan | | alias: _dummy_table | | Row Limit Per Split: 1 | | GatherStats: false | | Select Operator | | expressions: array('a','b') (type: array<string>) | | outputColumnNames: _col0 | | Lateral View Forward | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | Lateral View Join Operator | | outputColumnNames: _col0, _col1 | | Lateral View Forward | | Select Operator | | expressions: _col1 (type: string) | | outputColumnNames: _col1 | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: Const array<string> [a, b] (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col0, _col1 | | Lateral View Forward | | Select Operator | | expressions: _col1 (type: string) | | outputColumnNames: _col1 | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | | +----------------------------------------------------+ 71 rows selected (0.265 seconds) 0: jdbc:hive2://localhost:10000> SET hive.cbo.enable=false; No rows affected (0.005 seconds) 0: jdbc:hive2://localhost:10000> SET hive.cbo.enable; +------------------------+ | set | +------------------------+ | hive.cbo.enable=false | +------------------------+ 1 row selected (0.007 seconds) 0: jdbc:hive2://localhost:10000> EXPLAIN EXTENDED SELECT first_val, second_val . . . . . . . . . . . . . . . .> FROM (SELECT array('a', 'b') as val_array) inline_data . . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv1 AS first_val . . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv2 AS second_val . . . . . . . . . . . . . . . .> WHERE first_val != second_val; INFO : Compiling command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80): EXPLAIN EXTENDED SELECT first_val, second_val FROM (SELECT array('a', 'b') as val_array) inline_data LATERAL VIEW explode(val_array) lv1 AS first_val LATERAL VIEW explode(val_array) lv2 AS second_val WHERE first_val != second_val INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80); Time taken: 0.095 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80): EXPLAIN EXTENDED SELECT first_val, second_val FROM (SELECT array('a', 'b') as val_array) inline_data LATERAL VIEW explode(val_array) lv1 AS first_val LATERAL VIEW explode(val_array) lv2 AS second_val WHERE first_val != second_val INFO : Starting task [Stage-1:EXPLAIN] in serial mode INFO : Completed executing command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80); Time taken: 0.034 seconds +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-0 is a root stage | | | | STAGE PLANS: | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | TableScan | | alias: _dummy_table | | Row Limit Per Split: 1 | | GatherStats: false | | Select Operator | | expressions: array('a','b') (type: array<string>) | | outputColumnNames: _col0 | | Lateral View Forward | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | Lateral View Join Operator | | outputColumnNames: _col0, _col1 | | Lateral View Forward | | Select Operator | | expressions: _col1 (type: string) | | outputColumnNames: _col1 | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Filter Operator | | isSamplingPred: false | | predicate: (_col1 <> _col2) (type: boolean) | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Filter Operator | | isSamplingPred: false | | predicate: (_col1 <> _col2) (type: boolean) | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: Const array<string> [a, b] (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col0, _col1 | | Lateral View Forward | | Select Operator | | expressions: _col1 (type: string) | | outputColumnNames: _col1 | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Filter Operator | | isSamplingPred: false | | predicate: (_col1 <> _col2) (type: boolean) | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | Select Operator | | expressions: _col0 (type: array<string>) | | outputColumnNames: _col0 | | UDTF Operator | | function name: explode | | Lateral View Join Operator | | outputColumnNames: _col1, _col2 | | Filter Operator | | isSamplingPred: false | | predicate: (_col1 <> _col2) (type: boolean) | | Select Operator | | expressions: _col1 (type: string), _col2 (type: string) | | outputColumnNames: _col0, _col1 | | ListSink | | | +----------------------------------------------------+ 83 rows selected (0.145 seconds){code} > CBO returns incorrect results when WHERE clause filters multiple LATERAL VIEW > explode cross-products > ---------------------------------------------------------------------------------------------------- > > Key: HIVE-29084 > URL: https://issues.apache.org/jira/browse/HIVE-29084 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 4.1.0, 4.0.1 > Environment: - Affected Versions: Hive 4.0.1, 4.2.0-SNAPSHOT > (likely all CBO+lateral view versions) > - Execution Engine: MapReduce (local mode) > - Metastore: Derby embedded > Reporter: Konstantin Bereznyakov > Assignee: Konstantin Bereznyakov > Priority: Major > Attachments: hive_cbo_lateral_view_where_bug.log, > hive_cbo_lateral_view_where_bug.sql > > > When CBO is enabled, queries with multiple LATERAL VIEW explode operations > and WHERE clause filtering return incorrect results. The WHERE clause > selectivity is not properly applied during optimization, causing the query to > return rows that should be filtered out. > I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW > explode()" *and* a WHERE clause; it does not seem to happen with only one > "LATERAL VIEW explode()" or without a WHERE clause. > Originally reproduced with published 4.0.1 container, then verified with > locally built current master branch. > I was *not* able to reproduce it with unit tests. > {{*Minimal Reproduction Case:*}} > {{ }} > {{ DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}} > {{ CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}} > {{ INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}} > {{ ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;}}{{ ** }} > {{ *-- Correct result (CBO OFF)*}} > {{ SET hive.cbo.enable=false;}} > {{ SELECT first_val, second_val}} > {{ FROM cbo_lateral_view_where_bug}} > {{ LATERAL VIEW explode(values) lv1 AS first_val}} > {{ LATERAL VIEW explode(values) lv2 AS second_val}} > {{ WHERE first_val != second_val;}}{{ *-- Incorrect result (CBO ON)*}} > {{ SET hive.cbo.enable=true;}} > {{ SELECT first_val, second_val}} > {{ FROM cbo_lateral_view_where_bug}} > {{ LATERAL VIEW explode(values) lv1 AS first_val}} > {{ LATERAL VIEW explode(values) lv2 AS second_val}} > {{ WHERE first_val != second_val;}}{{ ** }} > {{ *–- Expected vs Actual Results:*}} > {{ - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}} > {{ - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}} > *Root Cause:* > CBO fails to properly estimate the selectivity of WHERE clauses that filter > cross-products created by multiple lateral views on the same array. The > optimizer appears to ignore or miscalculate the filtering effect, allowing > rows that should be excluded. > *Impact:* > - Data Correctness: Queries return wrong results, not just performance > issues > - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering > - Severity: Silent data corruption - users may not notice incorrect results > {*}Workaround{*}: > Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view > cross-products and WHERE clause filtering. > > Attachments: > - [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce > - Hive log: [^hive_cbo_lateral_view_where_bug.log] -- This message was sent by Atlassian Jira (v8.20.10#820010)