zhengruifeng commented on PR #42040: URL: https://github.com/apache/spark/pull/42040#issuecomment-1637817969
In https://github.com/apache/spark/pull/39925, we introduced a new mechanism to resolve expression with specified plan. However, sometimes the plan ID might be eliminated by the analyzer, and then some expressions can not be correctly resolved, this issue is the No.1 blocker of PS on Connect. Currently, I investigate the two examples [in the ticket](https://issues.apache.org/jira/browse/SPARK-43611) and check each rule applied to them. example 1: ``` >>> import pyspark.pandas as ps >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]}) >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]}) >>> psdf1.append(psdf2) ``` example 2: ``` import pyspark.pandas as ps import pandas as pd pdf = pd.DataFrame({"A": [None, 3, None, None], "B": [2, 4, None, 3], "C": [None, None, None, 1], "D": [0, 1, 5, 4],}, columns=["A", "B", "C", "D"],) psdf = ps.from_pandas(pdf) psdf.backfill() ``` In the draft, I modify two rules to retain the plan id. (actually, I modified [ResolveNaturalAndUsingJoin](https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3302-L3316) in https://github.com/apache/spark/commit/167bbca49c1c12ccd349d4330862c136b38d4522) I am wondering whether is there some graceful approach to fix this issue? Otherwise, I'm afraid I will touch more rules. cc @cloud-fan @HyukjinKwon @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
