drop user@spark and keep only dev@ This is something great to figure out, if you have time. Two things that would be great to try:
1. See how this works on Spark 2.0. 2. If it is slow, try the following: org.apache.spark.sql.catalyst.rules.RuleExecutor.resetTime() // run your query org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent() And report back where the time are spent if possible. Thanks! On Thu, Jun 30, 2016 at 2:53 PM, Darshan Singh <darshan.m...@gmail.com> wrote: > I am using 1.5.2. > > I have a data-frame with 10 column and then I pivot 1 column and generate > the 700 columns. > > it is like > > val df1 = sqlContext.read.parquet("file1") > df1.registerTempTable("df1") > val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then > col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else > 0.0 end ) as col4_700 from df1 group by col1, col2") > > Now this last statement takes around 20-30 seconds. I run this a number of > times only difference is that for df1 file is different. Everything else is > same. > > The actual statement takes 2-3 seconds so it is bit frustrating that just > generating plan for df2 is taking too much time.Worse thing is that this > run on driver so it is not palatalized. > > I have similar issue in another query where from these 700 columns we > generate more columns by adding or subtracting these and it again takes > lots of time. > > Not sure what could be done here. > > Thanks > > On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <r...@databricks.com> wrote: > >> Which version are you using here? If the underlying files change, >> technically we should go through optimization again. >> >> Perhaps the real "fix" is to figure out why is logical plan creation so >> slow for 700 columns. >> >> >> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <darshan.m...@gmail.com> >> wrote: >> >>> Is there a way I can use same Logical plan for a query. Everything will >>> be same except underlying file will be different. >>> >>> Issue is that my query has around 700 columns and Generating logical >>> plan takes 20 seconds and it happens every 2 minutes but every time >>> underlying file is different. >>> >>> I do not know these files in advance so I cant create the table on >>> directory level. These files are created and then used in the final query. >>> >>> Thanks >>> >> >> >