[
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195625#comment-15195625
]
Michael Nguyen commented on SPARK-13804:
----------------------------------------
I posted this issue to user@ at
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-count-Major-Divergent-Non-Linear-Performance-Slowdown-when-data-set-increases-from-4-millis-td26493.html
However, it has not been accepted by the mailing list yet. What needs to be
done for it to be accepted ? And what is the typical turn-around for postings
to be accepted ?
> Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance
> Slowdown going from 4million rows to 16+ million rows
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0
> Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
> Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run
> dataFrame.count()
> In the same environment with plenty of CPU and RAM, Spark SQL takes
> - 18.25 seconds to load a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark
> SQL to run dataframe.count () increases by 19.22 times. So the performance of
> dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows
> while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix this performance issue ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]