[
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907038#comment-15907038
]
Sean Owen commented on SPARK-18608:
-----------------------------------
Is the point here that the .ml implementation can check the storage level of
its input, and pass that information in some way, internally, to the .mllib
implementation that it delegates to? Yes I think that's the most feasible
answer to this particular problem, as long as it doesn't change a public API.
The .mllib implementations would have to have some internal mechanism for
getting this information about parents' storage level, if applicable.
It does raise the more general question of whether these implementation can
meaningfully decide about caching anyway, and whether they should try, rather
than just warn. More generally it's hard for any library function to reason
about whether to persist its input or not, or even, to reason about when a data
structure can be unpersisted. Those are much bigger and separate questions, but
it's why this type of question keeps popping up and is hard to solve.
> Spark ML algorithms that check RDD cache level for internal caching
> double-cache data
> -------------------------------------------------------------------------------------
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
> Issue Type: Bug
> Components: ML
> Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}},
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence
> internally. They check whether the input dataset is cached, and if not they
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}.
> This will actually always be true, since even if the dataset itself is
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input
> {{DataSet}}, but now we can, so the checks should be migrated to use
> {{dataset.storageLevel}}.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]