[
https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346274#comment-15346274
]
Sean Owen commented on SPARK-16169:
-----------------------------------
If you're saving more data each time, this would make sense. It's not clear
that you're not just doing that, perhaps by keeping references to past RDDs of
data. Your jobs have more and more data, more tasks, so something about your
code is asking to do more work.
> Saving Intermediate dataframe increasing processing time upto 5 times.
> ----------------------------------------------------------------------
>
> Key: SPARK-16169
> URL: https://issues.apache.org/jira/browse/SPARK-16169
> Project: Spark
> Issue Type: Question
> Components: Spark Submit, Web UI
> Affects Versions: 1.6.1
> Environment: Amazon EMR
> Reporter: Manish Kumar
> Labels: performance
> Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate
> dataframe, the application is taking processing time almost 5 times.
> Although the spark-UI clearly shows that all jobs are completed but the spark
> application remains in running status.
> Below is the command for saving the intermediate output and then using the
> dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode,
> previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is
> just saving the DataFrame as given location, then the previousDataFrame will
> be used by next steps/transformation.
> Below is the spark UI screenshot which shows jobs completed although some
> task inside it are neither completed nor skipped.
> !Spark-UI.png!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]