Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16609
If this is for display, I like the change. However, I am afraid it might
introduce some side effects.
In this PR, the major changes I have a concern are the following two
functions:
```Scala
def persist(): this.type = {
sparkSession.sharedState.cacheManager.cacheQuery(this, Option(name))
this
}
def persist(newLevel: StorageLevel): this.type = {
sparkSession.sharedState.cacheManager.cacheQuery(this, Option(name),
newLevel)
this
}
```
Before jumping to the new changes, let us see how we are using `cacheQuery`.
```Scala
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK)
```
`cacheQuery` is the API to cache the data produced by the logical
representation of the given [[Dataset]]. The field `tableName` is used for
display, but it is actually being used as an unique identifier, whose
uniqueness is enfored by the Catalog. Let us show the APIs that are using the
field `tableName`.
```Scala
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
Some(tableName))
}
override def uncacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.uncacheQuery(query =
sparkSession.table(tableName))
}
```
When users calling `cacheTable`, we cache the result set of the logical
plan of the table `tableName`. `tableName` is displayed in Storage tab as the
RDD Name, but users can use the same name to get the logical plan to uncache
the data.
This PR is trying to use the same field `tableName` for display, but users
are unable to use the name to uncache the data. This looks confusing to me. I
am not sure whether the others have the same concern.
IMO, a DataFrame/Dataset is like a temporary view. When we set a name to
DataFrame or Dataset, we basically create a named view. If we cache it, we are
putting it to a cross-session memory cache.
If we really need to improve the usability, I think we might be able to
register the dataFrame/dataset as a global temporary view, or improve the
existing cache management with something similar to global view management.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]