[PR] [SPARK-48258][PYTHON][CONNECT][FOLLOW-UP] Bind relation ID to the plan instead of DataFrame [spark]

via GitHub Tue, 21 May 2024 18:25:14 -0700


HyukjinKwon opened a new pull request, #46694:
URL: https://github.com/apache/spark/pull/46694


   ### What changes were proposed in this pull request?
   
   This PR addresses 
https://github.com/apache/spark/pull/46683#discussion_r1608527529 comment 
within Python, by using ID at the plan instead of DataFrame itself.
   
   ### Why are the changes needed?
   
   Because the DataFrame holds the relation ID, if DataFrame B are derived from 
DataFrame A, and DataFrame A is garbage-collected, then the cache might not 
exist anymore. See the example below:
   
   ```python
   df = spark.range(1).localCheckpoint()
   df2 = df.repartition(10)
   del df
   df2.collect()
   ```
   
   ```
   pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.sql.connect.common.InvalidPlanInput) No DataFrame with id 
a4efa660-897c-4500-bd4e-bd57cd0263d2 is found in the session 
cd4764b4-90a9-4249-9140-12a6e4a98cd3
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, the main change has not been released out yet.
   
   ### How was this patch tested?
   
   Manually tested, and added a unittest.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48258][PYTHON][CONNECT][FOLLOW-UP] Bind relation ID to the plan instead of DataFrame [spark]

Reply via email to