[spark] branch branch-3.4 updated: [SPARK-43894][PYTHON] Fix bug in df.cache()

hvanhovell Wed, 31 May 2023 08:57:16 -0700

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new 0e1401dc71b [SPARK-43894][PYTHON] Fix bug in df.cache()
0e1401dc71b is described below

commit 0e1401dc71b5aee540a54fc6a36a1857b13390b4
Author: Martin Grund <[email protected]>
AuthorDate: Wed May 31 11:55:19 2023 -0400

    [SPARK-43894][PYTHON] Fix bug in df.cache()
    
    ### What changes were proposed in this pull request?
    Previously calling `df.cache()` would result in an invalid plan input 
exception because we did not invoke `persist()` with the right arguments. This 
patch simplifies the logic and makes it compatible to the behavior in Spark 
itself.
    
    ### Why are the changes needed?
    Bug
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #41399 from grundprinzip/df_cache.
    
    Authored-by: Martin Grund <[email protected]>
    Signed-off-by: Herman van Hovell <[email protected]>
    (cherry picked from commit d3f76c6ca07a7a11fd228dde770186c0fbc3f03f)
    Signed-off-by: Herman van Hovell <[email protected]>
---
 python/pyspark/sql/connect/dataframe.py                | 4 +---
 python/pyspark/sql/tests/connect/test_connect_basic.py | 6 ++++++
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index ca2e1b7a0dc..03049109061 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1544,9 +1544,7 @@ class DataFrame:
     def cache(self) -> "DataFrame":
         if self._plan is None:
             raise Exception("Cannot cache on empty plan.")
-        relation = self._plan.plan(self._session.client)
-        self._session.client._analyze(method="persist", relation=relation)
-        return self
+        return self.persist()
 
     cache.__doc__ = PySparkDataFrame.cache.__doc__
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 008b95d6f14..b051b9233c8 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -3032,6 +3032,12 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
             message_parameters={"attr_name": "_jreader"},
         )
 
+    def test_df_caache(self):
+        df = self.connect.range(10)
+        df.cache()
+        self.assert_eq(10, df.count())
+        self.assertTrue(df.is_cached)
+
 
 class SparkConnectSessionTests(SparkConnectSQLTestCase):
     def _check_no_active_session_error(self, e: PySparkException):


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.4 updated: [SPARK-43894][PYTHON] Fix bug in df.cache()

Reply via email to