[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

Sean Owen (JIRA) Mon, 13 Mar 2017 02:02:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907038#comment-15907038
 ]


Sean Owen commented on SPARK-18608:
-----------------------------------

Is the point here that the .ml implementation can check the storage level of 
its input, and pass that information in some way, internally, to the .mllib 
implementation that it delegates to? Yes I think that's the most feasible 
answer to this particular problem, as long as it doesn't change a public API. 
The .mllib implementations would have to have some internal mechanism for 
getting this information about parents' storage level, if applicable.

It does raise the more general question of whether these implementation can 
meaningfully decide about caching anyway, and whether they should try, rather 
than just warn. More generally it's hard for any library function to reason 
about whether to persist its input or not, or even, to reason about when a data 
structure can be unpersisted. Those are much bigger and separate questions, but 
it's why this type of question keeps popping up and is hard to solve.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-18608
>                 URL: https://issues.apache.org/jira/browse/SPARK-18608
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

Reply via email to