[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205790#comment-15205790 ]
Sun Rui commented on SPARK-14037: --------------------------------- If possible, just use read.df() to load a DataFrame from a CSV file. Loading a CSV file into a local R data.frame and calling createDataFrame() on it to create a DataFrame is more time-consuming because it involves launching of external R processes on worker nodes and two rounds of data serialization/deserialization. 30 seconds is really slow, could you help to get metrics information? Since you are running on standalone mode, you can goto the web UI and find something like below in the worker stderr logs: ``` INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s ``` > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > ------------------------------------------------------------------------------ > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone > Reporter: Samuel Alexander > Labels: performance, sparkR > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org