[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

yuhao yang (JIRA) Mon, 13 Mar 2017 16:15:12 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923158#comment-15923158
 ]


yuhao yang commented on SPARK-18608:
------------------------------------

Thanks [~podongfeng] I'd say it's a better solution as it avoids API change. In 
the long term, this should be a temporary workaround until we migrate all the 
implementations from RDD to DataFrame. 

Also FYI, Nick mentioned something related 
[here|https://issues.apache.org/jira/browse/SPARK-19071?focusedCommentId=15834232&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15834232].
 I think the new solution can adapt to it with a setter method. Right now, we 
can just focus on resolving the double-caching issue.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-18608
>                 URL: https://issues.apache.org/jira/browse/SPARK-18608
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

Reply via email to