HyukjinKwon commented on code in PR #46683:
URL: https://github.com/apache/spark/pull/46683#discussion_r1609058630
##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -3468,6 +3554,26 @@ class Dataset[T] private[sql] (
}
}
+ // Visible for testing
+ private[sql] var cachedRemoteRelationID: Option[String] = None
+
+ override def finalize(): Unit = {
Review Comment:
For Python actually there's no alternative, and `__del__` isn't actually
deprecated in Python. I read why it's deprecated in JDK (e.g., no guarantee on
the order of `fianlize` invoke, etc.), and it won't affect this specific
usecase but yeah let me switch it to `java.lang.ref.Cleaner`.
> This assumes that the original dataframe is always in scope, however if
that is being garbage collected any derived dataframe breaks. I think we should
pin it to the original CachedRemoteRelation instead, that will be re-used in
derived dataframes.
Yeah, this is a good point. Let me fix it up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]