(spark) branch master updated: [SPARK-54653][DOCS] Add cross-session note to cache/persist public APIs

gengliang Wed, 10 Dec 2025 13:36:33 -0800

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 8065ba705ce7 [SPARK-54653][DOCS] Add cross-session note to 
cache/persist public APIs
8065ba705ce7 is described below

commit 8065ba705ce71e6215efbee8c79a040e3dc95245
Author: Yan Yan <[email protected]>
AuthorDate: Wed Dec 10 13:36:09 2025 -0800

    [SPARK-54653][DOCS] Add cross-session note to cache/persist public APIs
    
    Document that cached data is shared across all Spark sessions within an 
application for DataFrame/Dataset cache/persist methods and Catalog.cacheTable 
methods.
    
    ### What changes were proposed in this pull request?
    
    This change updates docs and comments to mention about the nature of 
dataframe cache being cross-session. Only public and newer apis docs are 
updated, e.g. `SQLContext` could be in scope but since it is a legacy api, no 
change was made to it.
    
    ### Why are the changes needed?
    
    To further clarify the existing behavior for the cache api.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Rebuild the entire project.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #53401 from yyanyy/cache-doc-update.
    
    Authored-by: Yan Yan <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
---
 docs/sql-ref-syntax-aux-cache-cache-table.md               |  2 ++
 docs/sql-ref-syntax-aux-cache-refresh.md                   |  5 ++++-
 docs/sql-ref-syntax-aux-cache-uncache-table.md             |  2 ++
 python/pyspark/sql/catalog.py                              | 14 ++++++++++++++
 python/pyspark/sql/dataframe.py                            |  7 +++++++
 sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala  | 12 ++++++++++++
 .../main/scala/org/apache/spark/sql/catalog/Catalog.scala  | 10 ++++++++++
 7 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/docs/sql-ref-syntax-aux-cache-cache-table.md 
b/docs/sql-ref-syntax-aux-cache-cache-table.md
index 9a1e61abbabb..ae9e208e7f4e 100644
--- a/docs/sql-ref-syntax-aux-cache-cache-table.md
+++ b/docs/sql-ref-syntax-aux-cache-cache-table.md
@@ -24,6 +24,8 @@ license: |
 `CACHE TABLE` statement caches contents of a table or output of a query with 
the given storage level. If a query is cached, then a temp view will be created 
for this query.
 This reduces scanning of the original files in future queries. 
 
+**Note:** Cached data is shared across all Spark sessions on the cluster. 
+
 ### Syntax
 
 ```sql
diff --git a/docs/sql-ref-syntax-aux-cache-refresh.md 
b/docs/sql-ref-syntax-aux-cache-refresh.md
index 715bdcac3b6f..534b3cee9e4a 100644
--- a/docs/sql-ref-syntax-aux-cache-refresh.md
+++ b/docs/sql-ref-syntax-aux-cache-refresh.md
@@ -23,7 +23,10 @@ license: |
 
 `REFRESH` is used to invalidate and refresh all the cached data (and the 
associated metadata) for
 all Datasets that contains the given data source path. Path matching is by 
prefix, i.e. "/" would
-invalidate everything that is cached. 
+invalidate everything that is cached.
+
+**Note:** Cached data is shared across all Spark sessions on the cluster, so 
refreshing it
+affects all sessions.
 
 ### Syntax
 
diff --git a/docs/sql-ref-syntax-aux-cache-uncache-table.md 
b/docs/sql-ref-syntax-aux-cache-uncache-table.md
index 4456378cdee1..b8ae8e3d4cef 100644
--- a/docs/sql-ref-syntax-aux-cache-uncache-table.md
+++ b/docs/sql-ref-syntax-aux-cache-uncache-table.md
@@ -24,6 +24,8 @@ license: |
 `UNCACHE TABLE` removes the entries and associated data from the in-memory 
and/or on-disk cache for a given table or view. The
 underlying entries should already have been brought to cache by previous 
`CACHE TABLE` operation. `UNCACHE TABLE` on a non-existent table throws an 
exception if `IF EXISTS` is not specified.
 
+**Note:** Cached data is shared across all Spark sessions on the cluster, so 
uncaching it affects all sessions.
+
 ### Syntax
 
 ```sql
diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 40a0d9346ccc..a74acc145647 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -1019,6 +1019,10 @@ class Catalog:
             .. versionchanged:: 3.5.0
                 Allow to specify storage level.
 
+        Notes
+        -----
+        Cached data is shared across all Spark sessions on the cluster.
+
         Examples
         --------
         >>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
@@ -1061,6 +1065,11 @@ class Catalog:
             .. versionchanged:: 3.4.0
                 Allow ``tableName`` to be qualified with catalog name.
 
+        Notes
+        -----
+        Cached data is shared across all Spark sessions on the cluster, so 
uncaching it
+        affects all sessions.
+
         Examples
         --------
         >>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
@@ -1091,6 +1100,11 @@ class Catalog:
 
         .. versionadded:: 2.0.0
 
+        Notes
+        -----
+        Cached data is shared across all Spark sessions on the cluster, so 
clearing
+        the cache affects all sessions.
+
         Examples
         --------
         >>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index bf81c13a7bac..d6b11169f007 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1515,6 +1515,8 @@ class DataFrame:
         -----
         The default storage level has changed to `MEMORY_AND_DISK_DESER` to 
match Scala in 3.0.
 
+        Cached data is shared across all Spark sessions on the cluster.
+
         Returns
         -------
         :class:`DataFrame`
@@ -1551,6 +1553,8 @@ class DataFrame:
         -----
         The default storage level has changed to `MEMORY_AND_DISK_DESER` to 
match Scala in 3.0.
 
+        Cached data is shared across all Spark sessions on the cluster.
+
         Parameters
         ----------
         storageLevel : :class:`StorageLevel`
@@ -1621,6 +1625,9 @@ class DataFrame:
         -----
         `blocking` default has changed to ``False`` to match Scala in 2.0.
 
+        Cached data is shared across all Spark sessions on the cluster, so 
unpersisting it
+        affects all sessions.
+
         Parameters
         ----------
         blocking : bool
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
index eda20f6fae80..6b06ce58df6b 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3018,6 +3018,8 @@ abstract class Dataset[T] extends Serializable {
   /**
    * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
    *
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster.
    * @group basic
    * @since 1.6.0
    */
@@ -3026,6 +3028,8 @@ abstract class Dataset[T] extends Serializable {
   /**
    * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
    *
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster.
    * @group basic
    * @since 1.6.0
    */
@@ -3037,6 +3041,8 @@ abstract class Dataset[T] extends Serializable {
    * @param newLevel
    *   One of: `MEMORY_ONLY`, `MEMORY_AND_DISK`, `MEMORY_ONLY_SER`, 
`MEMORY_AND_DISK_SER`,
    *   `DISK_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK_2`, etc.
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster.
    * @group basic
    * @since 1.6.0
    */
@@ -3056,6 +3062,9 @@ abstract class Dataset[T] extends Serializable {
    *
    * @param blocking
    *   Whether to block until all blocks are deleted.
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster, so 
unpersisting it affects
+   *   all sessions.
    * @group basic
    * @since 1.6.0
    */
@@ -3065,6 +3074,9 @@ abstract class Dataset[T] extends Serializable {
    * Mark the Dataset as non-persistent, and remove all blocks for it from 
memory and disk. This
    * will not un-persist any cached data that is built upon this Dataset.
    *
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster, so 
unpersisting it affects
+   *   all sessions.
    * @group basic
    * @since 1.6.0
    */
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
index 57b77d27b126..0b4b50af20d4 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
@@ -593,6 +593,8 @@ abstract class Catalog {
    *   is either a qualified or unqualified name that designates a table/view. 
If no database
    *   identifier is provided, it refers to a temporary view or a table/view 
in the current
    *   database.
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster.
    * @since 2.0.0
    */
   def cacheTable(tableName: String): Unit
@@ -606,6 +608,8 @@ abstract class Catalog {
    *   database.
    * @param storageLevel
    *   storage level to cache table.
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster.
    * @since 2.3.0
    */
   def cacheTable(tableName: String, storageLevel: StorageLevel): Unit
@@ -617,6 +621,9 @@ abstract class Catalog {
    *   is either a qualified or unqualified name that designates a table/view. 
If no database
    *   identifier is provided, it refers to a temporary view or a table/view 
in the current
    *   database.
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster, so 
uncaching it affects all
+   *   sessions.
    * @since 2.0.0
    */
   def uncacheTable(tableName: String): Unit
@@ -624,6 +631,9 @@ abstract class Catalog {
   /**
    * Removes all cached tables from the in-memory cache.
    *
+   * @note
+   *   Cached data is shared across all Spark sessions on the cluster, so 
clearing the cache
+   *   affects all sessions.
    * @since 2.0.0
    */
   def clearCache(): Unit


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54653][DOCS] Add cross-session note to cache/persist public APIs

Reply via email to