Re: Spark SQL Optimization

2016-03-23 Thread Takeshi Yamamuro
Hi, all

What's the size of three tables? Also, what's the performance difference of
the two queries?

On Tue, Mar 22, 2016 at 3:53 PM, Rishi Mishra <rmis...@snappydata.io> wrote:

> What we have observed so far is Spark picks join order in the same order
> as tables in from clause is specified.  Sometimes reordering benefits the
> join query.
> This can be an inbuilt optimization in Spark. But again its not going to
> be straight forward, where rather than table size,  selectivity of Join is
> important.
> Probably some kind of heuristic might help.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Mon, Mar 21, 2016 at 11:18 PM, gtinside <gtins...@gmail.com> wrote:
>
>> More details :
>>
>> Execution plan for Original query
>> select distinct pge.portfolio_code
>> from table1 pge join table2 p
>> on p.perm_group = pge.anc_port_group
>> join table3 uge
>> on p.user_group=uge.anc_user_group
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>>
>> == Physical Plan ==
>> TungstenAggregate(key=[portfolio_code#14119], functions=[],
>> output=[portfolio_code#14119])
>>  TungstenExchange hashpartitioning(portfolio_code#14119)
>>   TungstenAggregate(key=[portfolio_code#14119], functions=[],
>> output=[portfolio_code#14119])
>>TungstenProject [portfolio_code#14119]
>> BroadcastHashJoin [user_group#13665], [anc_user_group#13658],
>> BuildRight
>>  TungstenProject [portfolio_code#14119,user_group#13665]
>>   BroadcastHashJoin [anc_port_group#14117], [perm_group#13667],
>> BuildRight
>>ConvertToUnsafe
>> Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>>ConvertToUnsafe
>> Project [user_group#13665,perm_group#13667]
>>  Filter (perm_type#13666 = TEST)
>>   Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666]
>>  ConvertToUnsafe
>>   Project [anc_user_group#13658]
>>Filter (user_name#13659 = user)
>> Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]
>>
>>
>>
>> Execution plan for optimized query
>> select distinct pge.portfolio_code
>> from table1 uge, table2 p, table3 pge
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>> and p.perm_group = pge.anc_port_group
>> and p.user_group=uge.anc_user_group
>>
>> == Physical Plan ==
>> TungstenAggregate(key=[portfolio_code#14119], functions=[],
>> output=[portfolio_code#14119])
>>  TungstenExchange hashpartitioning(portfolio_code#14119)
>>   TungstenAggregate(key=[portfolio_code#14119], functions=[],
>> output=[portfolio_code#14119])
>>TungstenProject [portfolio_code#14119]
>> BroadcastHashJoin [perm_group#13667], [anc_port_group#14117],
>> BuildRight
>>  TungstenProject [perm_group#13667]
>>   BroadcastHashJoin [anc_user_group#13658], [user_group#13665],
>> BuildRight
>>ConvertToUnsafe
>> Project [anc_user_group#13658]
>>  Filter (user_name#13659 = user)
>>   Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]
>>ConvertToUnsafe
>> Project [perm_group#13667,user_group#13665]
>>  Filter (perm_type#13666 = TEST)
>>   Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666]
>>  ConvertToUnsafe
>>   Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: Spark SQL Optimization

2016-03-22 Thread Rishi Mishra
What we have observed so far is Spark picks join order in the same order as
tables in from clause is specified.  Sometimes reordering benefits the join
query.
This can be an inbuilt optimization in Spark. But again its not going to be
straight forward, where rather than table size,  selectivity of Join is
important.
Probably some kind of heuristic might help.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Mon, Mar 21, 2016 at 11:18 PM, gtinside <gtins...@gmail.com> wrote:

> More details :
>
> Execution plan for Original query
> select distinct pge.portfolio_code
> from table1 pge join table2 p
> on p.perm_group = pge.anc_port_group
> join table3 uge
> on p.user_group=uge.anc_user_group
> where uge.user_name = 'user' and p.perm_type = 'TEST'
>
> == Physical Plan ==
> TungstenAggregate(key=[portfolio_code#14119], functions=[],
> output=[portfolio_code#14119])
>  TungstenExchange hashpartitioning(portfolio_code#14119)
>   TungstenAggregate(key=[portfolio_code#14119], functions=[],
> output=[portfolio_code#14119])
>TungstenProject [portfolio_code#14119]
> BroadcastHashJoin [user_group#13665], [anc_user_group#13658],
> BuildRight
>  TungstenProject [portfolio_code#14119,user_group#13665]
>   BroadcastHashJoin [anc_port_group#14117], [perm_group#13667],
> BuildRight
>ConvertToUnsafe
> Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>ConvertToUnsafe
> Project [user_group#13665,perm_group#13667]
>  Filter (perm_type#13666 = TEST)
>   Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666]
>  ConvertToUnsafe
>   Project [anc_user_group#13658]
>Filter (user_name#13659 = user)
> Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]
>
>
>
> Execution plan for optimized query
> select distinct pge.portfolio_code
> from table1 uge, table2 p, table3 pge
> where uge.user_name = 'user' and p.perm_type = 'TEST'
> and p.perm_group = pge.anc_port_group
> and p.user_group=uge.anc_user_group
>
> == Physical Plan ==
> TungstenAggregate(key=[portfolio_code#14119], functions=[],
> output=[portfolio_code#14119])
>  TungstenExchange hashpartitioning(portfolio_code#14119)
>   TungstenAggregate(key=[portfolio_code#14119], functions=[],
> output=[portfolio_code#14119])
>TungstenProject [portfolio_code#14119]
> BroadcastHashJoin [perm_group#13667], [anc_port_group#14117],
> BuildRight
>  TungstenProject [perm_group#13667]
>   BroadcastHashJoin [anc_user_group#13658], [user_group#13665],
> BuildRight
>ConvertToUnsafe
> Project [anc_user_group#13658]
>  Filter (user_name#13659 = user)
>   Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]
>ConvertToUnsafe
> Project [perm_group#13667,user_group#13665]
>  Filter (perm_type#13666 = TEST)
>   Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666]
>  ConvertToUnsafe
>   Scan
>
> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL Optimization

2016-03-21 Thread gtinside
More details : 

Execution plan for Original query 
select distinct pge.portfolio_code 
from table1 pge join table2 p 
on p.perm_group = pge.anc_port_group 
join table3 uge 
on p.user_group=uge.anc_user_group 
where uge.user_name = 'user' and p.perm_type = 'TEST' 

== Physical Plan ==
TungstenAggregate(key=[portfolio_code#14119], functions=[],
output=[portfolio_code#14119])
 TungstenExchange hashpartitioning(portfolio_code#14119)
  TungstenAggregate(key=[portfolio_code#14119], functions=[],
output=[portfolio_code#14119])
   TungstenProject [portfolio_code#14119]
BroadcastHashJoin [user_group#13665], [anc_user_group#13658], BuildRight
 TungstenProject [portfolio_code#14119,user_group#13665]
  BroadcastHashJoin [anc_port_group#14117], [perm_group#13667],
BuildRight
   ConvertToUnsafe
Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
   ConvertToUnsafe
Project [user_group#13665,perm_group#13667]
 Filter (perm_type#13666 = TEST)
  Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666]
 ConvertToUnsafe
  Project [anc_user_group#13658]
   Filter (user_name#13659 = user)
Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]



Execution plan for optimized query 
select distinct pge.portfolio_code 
from table1 uge, table2 p, table3 pge 
where uge.user_name = 'user' and p.perm_type = 'TEST' 
and p.perm_group = pge.anc_port_group 
and p.user_group=uge.anc_user_group 

== Physical Plan ==
TungstenAggregate(key=[portfolio_code#14119], functions=[],
output=[portfolio_code#14119])
 TungstenExchange hashpartitioning(portfolio_code#14119)
  TungstenAggregate(key=[portfolio_code#14119], functions=[],
output=[portfolio_code#14119])
   TungstenProject [portfolio_code#14119]
BroadcastHashJoin [perm_group#13667], [anc_port_group#14117], BuildRight
 TungstenProject [perm_group#13667]
  BroadcastHashJoin [anc_user_group#13658], [user_group#13665],
BuildRight
   ConvertToUnsafe
Project [anc_user_group#13658]
 Filter (user_name#13659 = user)
  Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659]
   ConvertToUnsafe
Project [perm_group#13667,user_group#13665]
 Filter (perm_type#13666 = TEST)
  Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666]
 ConvertToUnsafe
  Scan
ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]












--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Optimization

2016-03-21 Thread Xiao Li
Hi, Maybe you can open a JIRA and upload your plan as Michael suggested.
This is an interesting feature. Thanks!

Xiao Li

2016-03-21 10:36 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> It's helpful if you can include the output of EXPLAIN EXTENDED or
> df.explain(true) whenever asking about query performance.
>
> On Mon, Mar 21, 2016 at 6:27 AM, gtinside <gtins...@gmail.com> wrote:
>
>> Hi ,
>>
>> I am trying to execute a simple query with join on 3 tables. When I look
>> at
>> the execution plan , it varies with position of table in the "from"
>> clause.
>> Execution plan looks more optimized when the position of table with
>> predicates is specified before any other table.
>>
>>
>> Original query :
>>
>> select distinct pge.portfolio_code
>> from table1 pge join table2 p
>> on p.perm_group = pge.anc_port_group
>> join table3 uge
>> on p.user_group=uge.anc_user_group
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>>
>> Optimized query (table with predicates is moved ahead):
>>
>> select distinct pge.portfolio_code
>> from table1 uge, table2 p, table3 pge
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>> and p.perm_group = pge.anc_port_group
>> and p.user_group=uge.anc_user_group
>>
>>
>> Execution plan is more optimized for the optimized query and hence the
>> query
>> executes faster. All the tables are being sourced from parquet files
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust
It's helpful if you can include the output of EXPLAIN EXTENDED or
df.explain(true) whenever asking about query performance.

On Mon, Mar 21, 2016 at 6:27 AM, gtinside <gtins...@gmail.com> wrote:

> Hi ,
>
> I am trying to execute a simple query with join on 3 tables. When I look at
> the execution plan , it varies with position of table in the "from" clause.
> Execution plan looks more optimized when the position of table with
> predicates is specified before any other table.
>
>
> Original query :
>
> select distinct pge.portfolio_code
> from table1 pge join table2 p
> on p.perm_group = pge.anc_port_group
> join table3 uge
> on p.user_group=uge.anc_user_group
> where uge.user_name = 'user' and p.perm_type = 'TEST'
>
> Optimized query (table with predicates is moved ahead):
>
> select distinct pge.portfolio_code
> from table1 uge, table2 p, table3 pge
> where uge.user_name = 'user' and p.perm_type = 'TEST'
> and p.perm_group = pge.anc_port_group
> and p.user_group=uge.anc_user_group
>
>
> Execution plan is more optimized for the optimized query and hence the
> query
> executes faster. All the tables are being sourced from parquet files
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark SQL Optimization

2016-03-21 Thread gtinside
Hi ,

I am trying to execute a simple query with join on 3 tables. When I look at
the execution plan , it varies with position of table in the "from" clause.
Execution plan looks more optimized when the position of table with
predicates is specified before any other table.


Original query :

select distinct pge.portfolio_code 
from table1 pge join table2 p
on p.perm_group = pge.anc_port_group 
join table3 uge
on p.user_group=uge.anc_user_group
where uge.user_name = 'user' and p.perm_type = 'TEST'

Optimized query (table with predicates is moved ahead):

select distinct pge.portfolio_code 
from table1 uge, table2 p, table3 pge 
where uge.user_name = 'user' and p.perm_type = 'TEST' 
and p.perm_group = pge.anc_port_group 
and p.user_group=uge.anc_user_group


Execution plan is more optimized for the optimized query and hence the query
executes faster. All the tables are being sourced from parquet files



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org