[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to massive query plans, which slows execution

Joseph K. Bradley (JIRA) Fri, 13 May 2016 12:38:00 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283052#comment-15283052
 ]


Joseph K. Bradley commented on SPARK-13346:
-------------------------------------------

Sure, the practical applications are pretty much every MLlib and GraphX 
algorithm.  In order to move any of those implementations to run on top of 
DataFrames, we will need to have this fixed.

For a concrete case with executable code, check out the BeliefPropagation 
example here: 
[https://github.com/graphframes/graphframes/blob/ac4a7c82dbde6529c98b3249a262cb958adaac43/src/main/scala/org/graphframes/examples/BeliefPropagation.scala]
  You can see a hack {{getCachedDataFrame}} which converts the current 
iteration's DataFrame to an RDD, caches it, and converts it back to a 
DataFrame.  Without this, the BP example dies after ~3 iterations on a tiny 
example graph (so it should be easy to reproduce the failure).

Let me know if I can be of help in exploring the failures; I have other code 
snippets too.

> Using DataFrames iteratively leads to massive query plans, which slows 
> execution
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-13346
>                 URL: https://issues.apache.org/jira/browse/SPARK-13346
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to massive query plans, which slows execution

Reply via email to