[jira] [Commented] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.

Sean Owen (JIRA) Thu, 23 Jun 2016 04:04:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346274#comment-15346274
 ]


Sean Owen commented on SPARK-16169:
-----------------------------------

If you're saving more data each time, this would make sense. It's not clear 
that you're not just doing that, perhaps by keeping references to past RDDs of 
data. Your jobs have more and more data, more tasks, so something about your 
code is asking to do more work.

> Saving Intermediate dataframe increasing processing time upto 5 times.
> ----------------------------------------------------------------------
>
>                 Key: SPARK-16169
>                 URL: https://issues.apache.org/jira/browse/SPARK-16169
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Submit, Web UI
>    Affects Versions: 1.6.1
>         Environment: Amazon EMR
>            Reporter: Manish Kumar
>              Labels: performance
>         Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate 
> dataframe, the application is taking processing time almost 5 times. 
> Although the spark-UI clearly shows that all jobs are completed but the spark 
> application remains in running status.
> Below is the command for saving the intermediate output and then using the 
> dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, 
> previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is 
> just saving the DataFrame as given location, then the previousDataFrame will 
> be used by next steps/transformation. 
> Below is the spark UI screenshot which shows jobs completed although some 
> task inside it are neither completed nor skipped.
> !Spark-UI.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.

Reply via email to