Sweet! It's here: https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437
On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust <mich...@databricks.com> wrote: > Can you add your description of the problem as a comment to that ticket > and we'll make sure to test both cases and break it out if the root cause > ends up being different. > > On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang <justin.u...@gmail.com> > wrote: > >> Sweet! Does this cover DataFrame#rdd also using the cached query from >> DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a >> derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, >> not whether the rdd from A.rdd or B.rdd uses the cached query of A. >> >> On Tue, Jul 28, 2015 at 11:33 PM Joseph Bradley <jos...@databricks.com> >> wrote: >> >>> Thanks for bringing this up! I talked with Michael Armbrust, and it >>> sounds like this is a from a bug in DataFrame caching: >>> https://issues.apache.org/jira/browse/SPARK-9141 >>> It's marked as a blocker for 1.5. >>> Joseph >>> >>> On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang <justin.u...@gmail.com> >>> wrote: >>> >>>> Hey guys, >>>> >>>> I'm running into some pretty bad performance issues when it comes to >>>> using a CrossValidator, because of caching behavior of DataFrames. >>>> >>>> The root of the problem is that while I have cached my DataFrame >>>> representing the features and labels, it is caching at the DataFrame level, >>>> while CrossValidator/LogisticRegression both drop down to the dataset.rdd >>>> level, which ignores the caching that I have previously done. This is >>>> worsened by the fact that for each combination of a fold and a param set >>>> from the grid, it recomputes my entire input dataset because the caching >>>> was lost. >>>> >>>> My current solution is to force the input DataFrame to be based off of >>>> a cached RDD, which I did with this horrible hack (had to drop down to java >>>> from the pyspark because of something to do with vectors not be inferred >>>> correctly): >>>> >>>> def checkpoint_dataframe_caching(df): >>>> return >>>> DataFrame(sqlContext._ssql_ctx.createDataFrame(df._jdf.rdd().cache(), >>>> train_data._jdf.schema()), sqlContext) >>>> >>>> before I pass it into the CrossValidator.fit(). If I do this, I still >>>> have to cache the underlying rdd once more than necessary (in addition to >>>> DataFrame#cache()), but at least in cross validation, it doesn't recompute >>>> the RDD graph anymore. >>>> >>>> Note, that input_df.rdd.cache() doesn't work because the python >>>> CrossValidator implementation applies some more dataframe transformations >>>> like filter, which then causes filtered_df.rdd to return a completely >>>> different rdd that recomputes the entire graph. >>>> >>>> Is it the intention of Spark SQL that calling DataFrame#rdd removes any >>>> caching that was done for the query? Is the fix as simple as getting the >>>> DataFrame#rdd to reference the cached query, or is there something more >>>> subtle going on. >>>> >>>> Best, >>>> >>>> Justin >>>> >>> >>> >