drop user@spark and keep only dev@

This is something great to figure out, if you have time. Two things that
would be great to try:

1. See how this works on Spark 2.0.

2. If it is slow, try the following:

org.apache.spark.sql.catalyst.rules.RuleExecutor.resetTime()

// run your query

org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent()


And report back where the time are spent if possible. Thanks!



On Thu, Jun 30, 2016 at 2:53 PM, Darshan Singh <darshan.m...@gmail.com>
wrote:

> I am using 1.5.2.
>
> I have a data-frame with 10 column and then I pivot 1 column and generate
> the 700 columns.
>
> it is like
>
> val df1 = sqlContext.read.parquet("file1")
> df1.registerTempTable("df1")
> val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
> col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
> 0.0 end ) as col4_700 from df1 group by col1, col2")
>
> Now this last statement takes around 20-30 seconds. I run this a number of
> times only difference is that for df1 file is different. Everything else is
> same.
>
> The actual statement takes 2-3 seconds so it is bit frustrating that just
> generating plan for df2 is taking too much time.Worse thing is that this
> run on driver so it is not palatalized.
>
> I have similar issue in another query where from these 700 columns we
> generate more columns by adding or subtracting these and it again takes
> lots of time.
>
> Not sure what could be done here.
>
> Thanks
>
> On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Which version are you using here? If the underlying files change,
>> technically we should go through optimization again.
>>
>> Perhaps the real "fix" is to figure out why is logical plan creation so
>> slow for 700 columns.
>>
>>
>> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <darshan.m...@gmail.com>
>> wrote:
>>
>>> Is there a way I can use same Logical plan for a query. Everything will
>>> be same except underlying file will be different.
>>>
>>> Issue is that my query has around 700 columns and Generating logical
>>> plan takes 20 seconds and it happens every 2 minutes but every time
>>> underlying file is different.
>>>
>>> I do not know these files in advance so I cant create the table on
>>> directory level. These files are created and then used in the final query.
>>>
>>> Thanks
>>>
>>
>>
>

Reply via email to