[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205790#comment-15205790
 ] 

Sun Rui commented on SPARK-14037:
---------------------------------

If possible, just use read.df() to load a DataFrame from a CSV file.
Loading a CSV file into a local R data.frame and calling createDataFrame() on 
it to create a DataFrame is more time-consuming because it involves launching 
of external R processes on worker nodes and two rounds of data 
serialization/deserialization.

30 seconds is really slow, could you help to get metrics information? Since you 
are running on standalone mode, you can goto the web UI and find something like 
below in the worker stderr logs:
```
INFO r.RRDD: Times: boot = 0.518 s, init = 0.009 s, broadcast = 0.000 s, 
read-input = 0.001 s, compute = 0.002 s, write-output = 0.074 s, total = 0.604 s
```

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-14037
>                 URL: https://issues.apache.org/jira/browse/SPARK-14037
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>            Reporter: Samuel Alexander
>              Labels: performance, sparkR
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to