Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

Justin Uang Fri, 31 Jul 2015 09:33:06 -0700

Sweet! It's here:
https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437


On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust <mich...@databricks.com>
wrote:

> Can you add your description of the problem as a comment to that ticket
> and we'll make sure to test both cases and break it out if the root cause
> ends up being different.
>
> On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> Sweet! Does this cover DataFrame#rdd also using the cached query from
>> DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a
>> derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A,
>> not whether the rdd from A.rdd or B.rdd uses the cached query of A.
>>
>> On Tue, Jul 28, 2015 at 11:33 PM Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> Thanks for bringing this up!  I talked with Michael Armbrust, and it
>>> sounds like this is a from a bug in DataFrame caching:
>>> https://issues.apache.org/jira/browse/SPARK-9141
>>> It's marked as a blocker for 1.5.
>>> Joseph
>>>
>>> On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang <justin.u...@gmail.com>
>>> wrote:
>>>
>>>> Hey guys,
>>>>
>>>> I'm running into some pretty bad performance issues when it comes to
>>>> using a CrossValidator, because of caching behavior of DataFrames.
>>>>
>>>> The root of the problem is that while I have cached my DataFrame
>>>> representing the features and labels, it is caching at the DataFrame level,
>>>> while CrossValidator/LogisticRegression both drop down to the dataset.rdd
>>>> level, which ignores the caching that I have previously done. This is
>>>> worsened by the fact that for each combination of a fold and a param set
>>>> from the grid, it recomputes my entire input dataset because the caching
>>>> was lost.
>>>>
>>>> My current solution is to force the input DataFrame to be based off of
>>>> a cached RDD, which I did with this horrible hack (had to drop down to java
>>>> from the pyspark because of something to do with vectors not be inferred
>>>> correctly):
>>>>
>>>> def checkpoint_dataframe_caching(df):
>>>>     return
>>>> DataFrame(sqlContext._ssql_ctx.createDataFrame(df._jdf.rdd().cache(),
>>>> train_data._jdf.schema()), sqlContext)
>>>>
>>>> before I pass it into the CrossValidator.fit(). If I do this, I still
>>>> have to cache the underlying rdd once more than necessary (in addition to
>>>> DataFrame#cache()), but at least in cross validation, it doesn't recompute
>>>> the RDD graph anymore.
>>>>
>>>> Note, that input_df.rdd.cache() doesn't work because the python
>>>> CrossValidator implementation applies some more dataframe transformations
>>>> like filter, which then causes filtered_df.rdd to return a completely
>>>> different rdd that recomputes the entire graph.
>>>>
>>>> Is it the intention of Spark SQL that calling DataFrame#rdd removes any
>>>> caching that was done for the query? Is the fix as simple as getting the
>>>> DataFrame#rdd to reference the cached query, or is there something more
>>>> subtle going on.
>>>>
>>>> Best,
>>>>
>>>> Justin
>>>>
>>>
>>>
>

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

Reply via email to