Re: [PR] [SPARK-48370][CONNECT] Checkpoint and localCheckpoint in Scala Spark Connect client [spark]

via GitHub Tue, 21 May 2024 16:39:57 -0700


HyukjinKwon commented on code in PR #46683:
URL: https://github.com/apache/spark/pull/46683#discussion_r1609058630



##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -3468,6 +3554,26 @@ class Dataset[T] private[sql] (
     }
   }
 
+  // Visible for testing
+  private[sql] var cachedRemoteRelationID: Option[String] = None
+
+  override def finalize(): Unit = {

Review Comment:
   For Python actually there's no alternative, and `__del__` isn't actually 
deprecated in Python. I read why it's deprecated in JDK (e.g., no guarantee on 
the order of `fianlize` invoke, etc.), and it won't affect this specific 
usecase but yeah let me switch it to `java.lang.ref.Cleaner`.
   
   > This assumes that the original dataframe is always in scope, however if 
that is being garbage collected any derived dataframe breaks. I think we should 
pin it to the original CachedRemoteRelation instead, that will be re-used in 
derived dataframes.
   
   Yeah, this is a good point. Let me fix it up.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48370][CONNECT] Checkpoint and localCheckpoint in Scala Spark Connect client [spark]

Reply via email to