zabetak commented on code in PR #6014: URL: https://github.com/apache/hive/pull/6014#discussion_r2315797079
########## ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q: ########## @@ -0,0 +1,33 @@ +-- SORT_QUERY_RESULTS +-- HIVE-29084: LATERAL VIEW cartesian product schema construction + +-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral view columns +SELECT t.key, t.value, lv.col +FROM (SELECT '238' AS key, 'val_238' AS value) t +LATERAL VIEW explode(array('238', '86', '311')) lv AS col +WHERE t.key = '333' OR lv.col = '86' Review Comment: The first part of the disjunction (`t.key = '333'`) is not verified. Given that the base table (`t`) does not have the value `333` we cannot verify that the respective part of the filter was not removed just by looking at the query output. Currently we are only testing `WHERE lv.col = '86'` not sure if that's the intention. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -602,12 +603,23 @@ private QueryBlockInfo convertSource(RelNode r) throws CalciteSemanticException // retrieve the base table source. QueryBlockInfo tableFunctionSource = convertSource(tfs.getInput(0)); - String sqAlias = tableFunctionSource.schema.get(0).table; - // the schema will contain the base table source fields - s = new Schema(tfs, sqAlias); - ast = createASTLateralView(tfs, s, tableFunctionSource, sqAlias); + // Create schema that preserves base table columns with original alias, + // but gives new UDTF columns the unique lateral view alias + int baseFieldCount = tableFunctionSource.schema.size(); + List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); + final String sqAlias = tableFunctionSource.schema.get(0).table; + Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, baseFieldCount).stream() + .map(field -> new ColumnInfo(sqAlias, field.getName())); + + final String lateralViewAlias = nextAlias(); + Stream<ColumnInfo> udtfColumnsStream = + allOutputFields.subList(baseFieldCount, allOutputFields.size()).stream() + .map(field -> new ColumnInfo(lateralViewAlias, field.getName())); + + s = new Schema(Stream.concat(baseColumnsStream, udtfColumnsStream).toList()); + ast = createASTLateralView(tfs, s, tableFunctionSource, lateralViewAlias); Review Comment: Concretely, I was thinking something like the following: ```suggestion final String lateralViewAlias = nextAlias(); ast = createASTLateralView(tfs, tableFunctionSource, lateralViewAlias); s = new Schema(tfs, lateralViewAlias); ``` assuming that the code inside `createASTLateralView` that needs a schema can use directly `tableFunctionSource.schema`. In this case, the "input" schema is that inside `tableFunctionSource` and the "output" schema is the one constructed in `s`. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -602,12 +604,26 @@ private QueryBlockInfo convertSource(RelNode r) throws CalciteSemanticException // retrieve the base table source. QueryBlockInfo tableFunctionSource = convertSource(tfs.getInput(0)); - String sqAlias = tableFunctionSource.schema.get(0).table; - // the schema will contain the base table source fields - s = new Schema(tfs, sqAlias); - ast = createASTLateralView(tfs, s, tableFunctionSource, sqAlias); + // Create schema that preserves base table columns with original alias, + // but gives new UDTF columns the unique lateral view alias + int baseFieldCount = tableFunctionSource.schema.size(); + List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList(); Review Comment: From the syntax definition, a [LATERAL VIEW](https://hive.apache.org/docs/latest/language/languagemanual-lateralview/) is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias. For all conversions, inside the `ASTConverter` we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the `createASTLateralView` method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below. ########## ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q: ########## @@ -0,0 +1,33 @@ +-- SORT_QUERY_RESULTS +-- HIVE-29084: LATERAL VIEW cartesian product schema construction Review Comment: nit: Remove completely or use the same summary with the JIRA ticket. ########## ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q: ########## @@ -0,0 +1,33 @@ +-- SORT_QUERY_RESULTS +-- HIVE-29084: LATERAL VIEW cartesian product schema construction + +-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral view columns +SELECT t.key, t.value, lv.col +FROM (SELECT '238' AS key, 'val_238' AS value) t Review Comment: nit: It seems that the value column (`'val_238' AS value`) is somewhat redundant for this and all subsequent queries so possibly the test cases can be simplified a bit further. ########## ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q: ########## @@ -0,0 +1,172 @@ +-- HIVE-29084: LATERAL VIEW cartesian product schema construction +DROP TABLE IF EXISTS src_lv_cart; +CREATE TABLE src_lv_cart (val_array array<string>); +INSERT INTO src_lv_cart VALUES (array('a', 'b')); + +SELECT first_val, second_val +FROM src_lv_cart +LATERAL VIEW explode(val_array) lv1 AS first_val +LATERAL VIEW explode(val_array) lv2 AS second_val +WHERE first_val != second_val +ORDER BY first_val, second_val; Review Comment: I guess the choice between `SORT_QUERY_RESULTS` and explicit `ORDER BY` in the query is somewhat subjective. Both can avoid test flakiness and each has its own advantages & disadvantages. Putting an `ORDER BY` in every query makes the tests more verbose and expands its scope. The plans will have more operators, `EXPLAIN` outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness. The `SORT_QUERY_RESULTS` applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope. Personally, for this case I feel that `SORT_QUERY_RESULTS` is a better choice but don't feel that strongly about it. I am OK to accept the `ORDER BY` approach if you prefer that. However, currently the test file contains both `SORT_QUERY_RESULTS` and `ORDER BY` clauses so we should remove one of them. I leave the final choice to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org