[spark] branch branch-3.4 updated: [SPARK-40154][PYTHON][DOCS] Correct storage level in Dataframe.cache docstring

srowen Wed, 25 Oct 2023 05:37:48 -0700

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new ecdb69f3db3 [SPARK-40154][PYTHON][DOCS] Correct storage level in 
Dataframe.cache docstring
ecdb69f3db3 is described below

commit ecdb69f3db3370aa7cf6ae8a52130379e465ca73
Author: Paul Staab <[email protected]>
AuthorDate: Wed Oct 25 07:36:15 2023 -0500

    [SPARK-40154][PYTHON][DOCS] Correct storage level in Dataframe.cache 
docstring
    
    ### What changes were proposed in this pull request?
    Corrects the docstring `DataFrame.cache` to give the correct storage level 
after it changed with Spark 3.0. It seems that the docstring of 
`DataFrame.persist` was updated, but `cache` was forgotten.
    
    ### Why are the changes needed?
    The doctoring claims that `cache` uses serialised storage, but it actually 
uses deserialised storage. I confirmed that this is still the case with Spark 
3.5.0 using the example code from the Jira ticket.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the docstring changes.
    
    ### How was this patch tested?
    The Github actions workflow succeeded.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #43229 from paulstaab/SPARK-40154.
    
    Authored-by: Paul Staab <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    (cherry picked from commit 94607dd001b133a25dc9865f25b3f9e7f5a5daa3)
    Signed-off-by: Sean Owen <[email protected]>
---
 python/pyspark/sql/dataframe.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 518bc9867d7..14426c51439 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1404,7 +1404,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         self.rdd.foreachPartition(f)  # type: ignore[arg-type]
 
     def cache(self) -> "DataFrame":
-        """Persists the :class:`DataFrame` with the default storage level 
(`MEMORY_AND_DISK`).
+        """Persists the :class:`DataFrame` with the default storage level 
(`MEMORY_AND_DISK_DESER`).
 
         .. versionadded:: 1.3.0
 
@@ -1413,7 +1413,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 
         Notes
         -----
-        The default storage level has changed to `MEMORY_AND_DISK` to match 
Scala in 2.0.
+        The default storage level has changed to `MEMORY_AND_DISK_DESER` to 
match Scala in 3.0.
 
         Returns
         -------


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.4 updated: [SPARK-40154][PYTHON][DOCS] Correct storage level in Dataframe.cache docstring

Reply via email to