This is an automated email from the ASF dual-hosted git repository.
gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 8065ba705ce7 [SPARK-54653][DOCS] Add cross-session note to
cache/persist public APIs
8065ba705ce7 is described below
commit 8065ba705ce71e6215efbee8c79a040e3dc95245
Author: Yan Yan <[email protected]>
AuthorDate: Wed Dec 10 13:36:09 2025 -0800
[SPARK-54653][DOCS] Add cross-session note to cache/persist public APIs
Document that cached data is shared across all Spark sessions within an
application for DataFrame/Dataset cache/persist methods and Catalog.cacheTable
methods.
### What changes were proposed in this pull request?
This change updates docs and comments to mention about the nature of
dataframe cache being cross-session. Only public and newer apis docs are
updated, e.g. `SQLContext` could be in scope but since it is a legacy api, no
change was made to it.
### Why are the changes needed?
To further clarify the existing behavior for the cache api.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Rebuild the entire project.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #53401 from yyanyy/cache-doc-update.
Authored-by: Yan Yan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
---
docs/sql-ref-syntax-aux-cache-cache-table.md | 2 ++
docs/sql-ref-syntax-aux-cache-refresh.md | 5 ++++-
docs/sql-ref-syntax-aux-cache-uncache-table.md | 2 ++
python/pyspark/sql/catalog.py | 14 ++++++++++++++
python/pyspark/sql/dataframe.py | 7 +++++++
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala | 12 ++++++++++++
.../main/scala/org/apache/spark/sql/catalog/Catalog.scala | 10 ++++++++++
7 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/docs/sql-ref-syntax-aux-cache-cache-table.md
b/docs/sql-ref-syntax-aux-cache-cache-table.md
index 9a1e61abbabb..ae9e208e7f4e 100644
--- a/docs/sql-ref-syntax-aux-cache-cache-table.md
+++ b/docs/sql-ref-syntax-aux-cache-cache-table.md
@@ -24,6 +24,8 @@ license: |
`CACHE TABLE` statement caches contents of a table or output of a query with
the given storage level. If a query is cached, then a temp view will be created
for this query.
This reduces scanning of the original files in future queries.
+**Note:** Cached data is shared across all Spark sessions on the cluster.
+
### Syntax
```sql
diff --git a/docs/sql-ref-syntax-aux-cache-refresh.md
b/docs/sql-ref-syntax-aux-cache-refresh.md
index 715bdcac3b6f..534b3cee9e4a 100644
--- a/docs/sql-ref-syntax-aux-cache-refresh.md
+++ b/docs/sql-ref-syntax-aux-cache-refresh.md
@@ -23,7 +23,10 @@ license: |
`REFRESH` is used to invalidate and refresh all the cached data (and the
associated metadata) for
all Datasets that contains the given data source path. Path matching is by
prefix, i.e. "/" would
-invalidate everything that is cached.
+invalidate everything that is cached.
+
+**Note:** Cached data is shared across all Spark sessions on the cluster, so
refreshing it
+affects all sessions.
### Syntax
diff --git a/docs/sql-ref-syntax-aux-cache-uncache-table.md
b/docs/sql-ref-syntax-aux-cache-uncache-table.md
index 4456378cdee1..b8ae8e3d4cef 100644
--- a/docs/sql-ref-syntax-aux-cache-uncache-table.md
+++ b/docs/sql-ref-syntax-aux-cache-uncache-table.md
@@ -24,6 +24,8 @@ license: |
`UNCACHE TABLE` removes the entries and associated data from the in-memory
and/or on-disk cache for a given table or view. The
underlying entries should already have been brought to cache by previous
`CACHE TABLE` operation. `UNCACHE TABLE` on a non-existent table throws an
exception if `IF EXISTS` is not specified.
+**Note:** Cached data is shared across all Spark sessions on the cluster, so
uncaching it affects all sessions.
+
### Syntax
```sql
diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 40a0d9346ccc..a74acc145647 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -1019,6 +1019,10 @@ class Catalog:
.. versionchanged:: 3.5.0
Allow to specify storage level.
+ Notes
+ -----
+ Cached data is shared across all Spark sessions on the cluster.
+
Examples
--------
>>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
@@ -1061,6 +1065,11 @@ class Catalog:
.. versionchanged:: 3.4.0
Allow ``tableName`` to be qualified with catalog name.
+ Notes
+ -----
+ Cached data is shared across all Spark sessions on the cluster, so
uncaching it
+ affects all sessions.
+
Examples
--------
>>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
@@ -1091,6 +1100,11 @@ class Catalog:
.. versionadded:: 2.0.0
+ Notes
+ -----
+ Cached data is shared across all Spark sessions on the cluster, so
clearing
+ the cache affects all sessions.
+
Examples
--------
>>> _ = spark.sql("DROP TABLE IF EXISTS tbl1")
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index bf81c13a7bac..d6b11169f007 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1515,6 +1515,8 @@ class DataFrame:
-----
The default storage level has changed to `MEMORY_AND_DISK_DESER` to
match Scala in 3.0.
+ Cached data is shared across all Spark sessions on the cluster.
+
Returns
-------
:class:`DataFrame`
@@ -1551,6 +1553,8 @@ class DataFrame:
-----
The default storage level has changed to `MEMORY_AND_DISK_DESER` to
match Scala in 3.0.
+ Cached data is shared across all Spark sessions on the cluster.
+
Parameters
----------
storageLevel : :class:`StorageLevel`
@@ -1621,6 +1625,9 @@ class DataFrame:
-----
`blocking` default has changed to ``False`` to match Scala in 2.0.
+ Cached data is shared across all Spark sessions on the cluster, so
unpersisting it
+ affects all sessions.
+
Parameters
----------
blocking : bool
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
b/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
index eda20f6fae80..6b06ce58df6b 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3018,6 +3018,8 @@ abstract class Dataset[T] extends Serializable {
/**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
*
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster.
* @group basic
* @since 1.6.0
*/
@@ -3026,6 +3028,8 @@ abstract class Dataset[T] extends Serializable {
/**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
*
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster.
* @group basic
* @since 1.6.0
*/
@@ -3037,6 +3041,8 @@ abstract class Dataset[T] extends Serializable {
* @param newLevel
* One of: `MEMORY_ONLY`, `MEMORY_AND_DISK`, `MEMORY_ONLY_SER`,
`MEMORY_AND_DISK_SER`,
* `DISK_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK_2`, etc.
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster.
* @group basic
* @since 1.6.0
*/
@@ -3056,6 +3062,9 @@ abstract class Dataset[T] extends Serializable {
*
* @param blocking
* Whether to block until all blocks are deleted.
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster, so
unpersisting it affects
+ * all sessions.
* @group basic
* @since 1.6.0
*/
@@ -3065,6 +3074,9 @@ abstract class Dataset[T] extends Serializable {
* Mark the Dataset as non-persistent, and remove all blocks for it from
memory and disk. This
* will not un-persist any cached data that is built upon this Dataset.
*
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster, so
unpersisting it affects
+ * all sessions.
* @group basic
* @since 1.6.0
*/
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
b/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
index 57b77d27b126..0b4b50af20d4 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
@@ -593,6 +593,8 @@ abstract class Catalog {
* is either a qualified or unqualified name that designates a table/view.
If no database
* identifier is provided, it refers to a temporary view or a table/view
in the current
* database.
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster.
* @since 2.0.0
*/
def cacheTable(tableName: String): Unit
@@ -606,6 +608,8 @@ abstract class Catalog {
* database.
* @param storageLevel
* storage level to cache table.
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster.
* @since 2.3.0
*/
def cacheTable(tableName: String, storageLevel: StorageLevel): Unit
@@ -617,6 +621,9 @@ abstract class Catalog {
* is either a qualified or unqualified name that designates a table/view.
If no database
* identifier is provided, it refers to a temporary view or a table/view
in the current
* database.
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster, so
uncaching it affects all
+ * sessions.
* @since 2.0.0
*/
def uncacheTable(tableName: String): Unit
@@ -624,6 +631,9 @@ abstract class Catalog {
/**
* Removes all cached tables from the in-memory cache.
*
+ * @note
+ * Cached data is shared across all Spark sessions on the cluster, so
clearing the cache
+ * affects all sessions.
* @since 2.0.0
*/
def clearCache(): Unit
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]