Re: [PR] HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD [hive]

via GitHub Tue, 02 Sep 2025 05:59:13 -0700


zabetak commented on code in PR #6014:
URL: https://github.com/apache/hive/pull/6014#discussion_r2315797079



##########
ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q:
##########
@@ -0,0 +1,33 @@
+-- SORT_QUERY_RESULTS
+-- HIVE-29084: LATERAL VIEW cartesian product schema construction
+
+-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral 
view columns
+SELECT t.key, t.value, lv.col
+FROM (SELECT '238' AS key, 'val_238' AS value) t
+LATERAL VIEW explode(array('238', '86', '311')) lv AS col
+WHERE t.key = '333' OR lv.col = '86'

Review Comment:
   The first part of the disjunction (`t.key = '333'`) is not verified. Given 
that the base table (`t`) does not have the value `333` we cannot verify that 
the respective part of the filter was not removed just by looking at the query 
output. Currently we are only testing `WHERE lv.col = '86'` not sure if that's 
the intention.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -602,12 +603,23 @@ private QueryBlockInfo convertSource(RelNode r) throws 
CalciteSemanticException
 
       // retrieve the base table source.
       QueryBlockInfo tableFunctionSource = convertSource(tfs.getInput(0));
-      String sqAlias = tableFunctionSource.schema.get(0).table;
-      // the schema will contain the base table source fields
-      s = new Schema(tfs, sqAlias);
 
-      ast = createASTLateralView(tfs, s, tableFunctionSource, sqAlias);
+      // Create schema that preserves base table columns with original alias,
+      // but gives new UDTF columns the unique lateral view alias
+      int baseFieldCount = tableFunctionSource.schema.size();
+      List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();
 
+      final String sqAlias = tableFunctionSource.schema.get(0).table;
+      Stream<ColumnInfo> baseColumnsStream = allOutputFields.subList(0, 
baseFieldCount).stream()
+          .map(field -> new ColumnInfo(sqAlias, field.getName()));
+
+      final String lateralViewAlias = nextAlias();
+      Stream<ColumnInfo> udtfColumnsStream =
+          allOutputFields.subList(baseFieldCount, 
allOutputFields.size()).stream()
+              .map(field -> new ColumnInfo(lateralViewAlias, field.getName()));
+
+      s = new Schema(Stream.concat(baseColumnsStream, 
udtfColumnsStream).toList());
+      ast = createASTLateralView(tfs, s, tableFunctionSource, 
lateralViewAlias);

Review Comment:
   Concretely, I was thinking something like the following:
   ```suggestion
         final String lateralViewAlias = nextAlias();
         ast = createASTLateralView(tfs, tableFunctionSource, lateralViewAlias);
         s = new Schema(tfs, lateralViewAlias);
   ```
   assuming that the code inside `createASTLateralView` that needs a schema can 
use directly `tableFunctionSource.schema`.
   
   In this case, the "input" schema is that inside `tableFunctionSource` and 
the "output" schema is the one constructed in `s`.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -602,12 +604,26 @@ private QueryBlockInfo convertSource(RelNode r) throws 
CalciteSemanticException
 
       // retrieve the base table source.
       QueryBlockInfo tableFunctionSource = convertSource(tfs.getInput(0));
-      String sqAlias = tableFunctionSource.schema.get(0).table;
-      // the schema will contain the base table source fields
-      s = new Schema(tfs, sqAlias);
 
-      ast = createASTLateralView(tfs, s, tableFunctionSource, sqAlias);
+      // Create schema that preserves base table columns with original alias,
+      // but gives new UDTF columns the unique lateral view alias
+      int baseFieldCount = tableFunctionSource.schema.size();
+      List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();

Review Comment:
   From the syntax definition, a [LATERAL 
VIEW](https://hive.apache.org/docs/latest/language/languagemanual-lateralview/) 
is a virtual table with a user-defined table alias. Conceptually, every column 
that is in the output of the lateral view has the same table alias so I would 
expect that all columns in the same schema should have the same alias.
   
   For all conversions, inside the `ASTConverter` we should distinguish the 
input schema(s) from the output schema. Both are very important for correctly 
and unambiguously constructing the AST/SQL query. For the lateral view case, 
input and output schema are somewhat mixed together and maybe they shouldn't. 
Some code inside the `createASTLateralView` method operates on the input schema 
and some other on the output schema. In other words, up to a certain point in 
the code, I think we could use the schema as is from the input/source and once 
we are done we could simply generate the output (new) schema using a new 
(generated) table alias. The idea is outlined on the comment below.



##########
ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q:
##########
@@ -0,0 +1,33 @@
+-- SORT_QUERY_RESULTS
+-- HIVE-29084: LATERAL VIEW cartesian product schema construction

Review Comment:
   nit: Remove completely or use the same summary with the JIRA ticket.



##########
ql/src/test/queries/clientpositive/lateral_view_cbo_ppd_filter_loss.q:
##########
@@ -0,0 +1,33 @@
+-- SORT_QUERY_RESULTS
+-- HIVE-29084: LATERAL VIEW cartesian product schema construction
+
+-- Verifies PPD doesn't eliminate OR filter comparing base table vs lateral 
view columns
+SELECT t.key, t.value, lv.col
+FROM (SELECT '238' AS key, 'val_238' AS value) t

Review Comment:
   nit: It seems that the value column (`'val_238' AS value`) is somewhat 
redundant for this and all subsequent queries so possibly the test cases can be 
simplified a bit further.



##########
ql/src/test/queries/clientpositive/lateral_view_cartesian_test.q:
##########
@@ -0,0 +1,172 @@
+-- HIVE-29084: LATERAL VIEW cartesian product schema construction
+DROP TABLE IF EXISTS src_lv_cart;
+CREATE TABLE src_lv_cart (val_array array<string>);
+INSERT INTO src_lv_cart VALUES (array('a', 'b'));
+
+SELECT first_val, second_val
+FROM src_lv_cart
+LATERAL VIEW explode(val_array) lv1 AS first_val
+LATERAL VIEW explode(val_array) lv2 AS second_val
+WHERE first_val != second_val
+ORDER BY first_val, second_val;

Review Comment:
   I guess the choice between `SORT_QUERY_RESULTS` and explicit `ORDER BY` in 
the query is somewhat subjective.
   Both can avoid test flakiness and each has its own advantages & 
disadvantages.
   
   Putting an `ORDER BY` in every query makes the tests more verbose and 
expands its scope. The plans will have more operators, `EXPLAIN` outputs will 
contain more info than strictly necessary, and potentially more rules will 
match/apply and affect the output plan. On the positive side, it is a native 
way to enforce sorted output and avoid potential test flakiness.
   
   The `SORT_QUERY_RESULTS` applies to all queries inside the file and it is a 
post-processing step completely independent of the query execution. Test 
inputs/outputs are less verbose and flakiness does not interfere with the query 
execution and the actual testing scope.
   
   Personally, for this case I feel that `SORT_QUERY_RESULTS` is a better 
choice but don't feel that strongly about it. I am OK to accept the `ORDER BY` 
approach if you prefer that. However, currently the test file contains both 
`SORT_QUERY_RESULTS` and `ORDER BY` clauses so we should remove one of them. I 
leave the final choice to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD [hive]

Reply via email to