[
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206302#comment-15206302
]
Vincent Ohprecio edited comment on SPARK-14031 at 3/22/16 1:02 PM:
-------------------------------------------------------------------
The code an example from Apache Spark docs using the standard package csv
library from databricks with a larger csv found here:
`https://github.com/databricks/spark-csv`
Full code to reproduce here:
`import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("/Users/employee/Downloads/2008.csv")
val selectedData = df.select("Year", "Cancelled")
selectedData.write.format("com.databricks.spark.csv").option("header",
"true").save("output.csv")
`
was (Author: vohprecio):
The code an example from Apache Spark docs using the standard package csv
library from databricks with a larger csv:
`https://github.com/databricks/spark-csv`
`import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("/Users/employee/Downloads/2008.csv")
val selectedData = df.select("Year", "Cancelled")
selectedData.write.format("com.databricks.spark.csv").option("header",
"true").save("output.csv")
`
> Dataframe to csv IO, system performance enters high CPU state and write
> operation takes 1 hour to complete
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
> Issue Type: Bug
> Components: Spark Shell
> Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
> Reporter: Vincent Ohprecio
> Priority: Minor
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of
> dataframe to csv, system performance enters high CPU state and write
> operation takes 1 hour to complete.
> * Affecting: [Stage 5:> (0 + 2) / 21]
> * Stage 5 elapsed time 3488272270000ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data
> and Stage5 csv write times where between 2 - 22 seconds.
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where
> similar between 2 - 22 seconds.
> Files
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]