[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206307#comment-15206307 ]
Samuel Alexander commented on SPARK-14037: ------------------------------------------ I see only the below details in "Executors" tab. Executor ID Address RDD Blocks Storage Memory Disk Used Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time Input Shuffle Read Shuffle Write Thread Dump driver localhost:33411 0 0.0 B / 511.1 MB 0.0 B 0 0 1 1 528 ms 0.0 B 0.0 B 0.0 B Thread Dump Do you need Thread Dump? I don't find executor summary in Spark UI. Do I need to enable it by providing some argument in SparkR shell? > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > ------------------------------------------------------------------------------ > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone > Reporter: Samuel Alexander > Labels: performance, sparkR > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org