[GitHub] spark issue #16609: [SPARK-8480] [CORE] [PYSPARK] [SPARKR] Add setName for D...

gatorsmile Mon, 30 Jan 2017 10:38:16 -0800

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16609
  
    If this is for display, I like the change. However, I am afraid it might 
introduce some side effects. 
    
    In this PR, the major changes I have a concern are the following two 
functions:
    ```Scala
      def persist(): this.type = {
        sparkSession.sharedState.cacheManager.cacheQuery(this, Option(name))
        this
      }
    
      def persist(newLevel: StorageLevel): this.type = {
        sparkSession.sharedState.cacheManager.cacheQuery(this, Option(name), 
newLevel)
        this
      }
    ```
    
    Before jumping to the new changes, let us see how we are using `cacheQuery`.
    ```Scala
      def cacheQuery(
          query: Dataset[_],
          tableName: Option[String] = None,
          storageLevel: StorageLevel = MEMORY_AND_DISK)
    ```
    
    `cacheQuery` is the API to cache the data produced by the logical 
representation of the given [[Dataset]]. The field `tableName` is used for 
display, but it is actually being used as an unique identifier, whose 
uniqueness is enfored by the Catalog. Let us show the APIs that are using the 
field `tableName`.
    
    ```Scala
      override def cacheTable(tableName: String): Unit = {
        
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName))
      }
    
      override def uncacheTable(tableName: String): Unit = {
        sparkSession.sharedState.cacheManager.uncacheQuery(query = 
sparkSession.table(tableName))
      }
    ```
    
    When users calling `cacheTable`, we cache the result set of the logical 
plan of the table `tableName`. `tableName` is displayed in Storage tab as the 
RDD Name, but users can use the same name to get the logical plan to uncache 
the data.
    
    This PR is trying to use the same field `tableName` for display, but users 
are unable to use the name to uncache the data. This looks confusing to me. I 
am not sure whether the others have the same concern. 
    
    IMO, a DataFrame/Dataset is like a temporary view. When we set a name to 
DataFrame or Dataset, we basically create a named view. If we cache it, we are 
putting it to a cross-session memory cache. 
    
    If we really need to improve the usability, I think we might be able to 
register the dataFrame/dataset as a global temporary view, or improve the 
existing cache management with something similar to global view management.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16609: [SPARK-8480] [CORE] [PYSPARK] [SPARKR] Add setName for D...

Reply via email to