zhengruifeng commented on code in PR #56398:
URL: https://github.com/apache/spark/pull/56398#discussion_r3400651185
##########
python/pyspark/sql/tests/connect/test_parity_column.py:
##########
@@ -50,6 +50,63 @@ def test_resolve_after_union(self):
with self.assertRaisesRegex(AnalysisException,
"CANNOT_RESOLVE_DATAFRAME_COLUMN"):
df1.union(df2).select(df1.c).collect()
+ # zip merges the two column-projected sides into a single plan, so the
+ # per-DataFrame plan-id tags do not survive ResolveZip. A tagged left/right
+ # reference can no longer be found and raises in both strict and lenient
Review Comment:
Ran it - your trace holds exactly when the base's plan root is the relation
node itself, but not for `createDataFrame`:
- `range` base: the root is the bare `Range` node, which `ResolveZip` reuses
unchanged as the merged base, so its plan-id tag survives and
`r.zip(rr).select(r.id)` resolves on Connect in both strict and lenient modes -
as you predicted.
- `createDataFrame` base: it analyzes to `Project [a AS a, b AS b]` over a
`LocalRelation`, and the plan-id tag sits on that `Project` - which
`analyzeChain` dissolves into the merged chain - so
`df.zip(right).select(df.a)` raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` like the
projected sides (both modes).
Added both shapes as tests in ee26643956f:
`test_resolve_after_zip_base_side` (createDataFrame; parity override asserts
the raise) and `test_resolve_after_zip_bare_base_side` (range; resolves
everywhere, inherited with no override). Also noted the boundary in the PR
description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]