Re: Spark SQL Optimization
Hi, all What's the size of three tables? Also, what's the performance difference of the two queries? On Tue, Mar 22, 2016 at 3:53 PM, Rishi Mishra <rmis...@snappydata.io> wrote: > What we have observed so far is Spark picks join order in the same order > as tables in from clause is specified. Sometimes reordering benefits the > join query. > This can be an inbuilt optimization in Spark. But again its not going to > be straight forward, where rather than table size, selectivity of Join is > important. > Probably some kind of heuristic might help. > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Mon, Mar 21, 2016 at 11:18 PM, gtinside <gtins...@gmail.com> wrote: > >> More details : >> >> Execution plan for Original query >> select distinct pge.portfolio_code >> from table1 pge join table2 p >> on p.perm_group = pge.anc_port_group >> join table3 uge >> on p.user_group=uge.anc_user_group >> where uge.user_name = 'user' and p.perm_type = 'TEST' >> >> == Physical Plan == >> TungstenAggregate(key=[portfolio_code#14119], functions=[], >> output=[portfolio_code#14119]) >> TungstenExchange hashpartitioning(portfolio_code#14119) >> TungstenAggregate(key=[portfolio_code#14119], functions=[], >> output=[portfolio_code#14119]) >>TungstenProject [portfolio_code#14119] >> BroadcastHashJoin [user_group#13665], [anc_user_group#13658], >> BuildRight >> TungstenProject [portfolio_code#14119,user_group#13665] >> BroadcastHashJoin [anc_port_group#14117], [perm_group#13667], >> BuildRight >>ConvertToUnsafe >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] >>ConvertToUnsafe >> Project [user_group#13665,perm_group#13667] >> Filter (perm_type#13666 = TEST) >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666] >> ConvertToUnsafe >> Project [anc_user_group#13658] >>Filter (user_name#13659 = user) >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] >> >> >> >> Execution plan for optimized query >> select distinct pge.portfolio_code >> from table1 uge, table2 p, table3 pge >> where uge.user_name = 'user' and p.perm_type = 'TEST' >> and p.perm_group = pge.anc_port_group >> and p.user_group=uge.anc_user_group >> >> == Physical Plan == >> TungstenAggregate(key=[portfolio_code#14119], functions=[], >> output=[portfolio_code#14119]) >> TungstenExchange hashpartitioning(portfolio_code#14119) >> TungstenAggregate(key=[portfolio_code#14119], functions=[], >> output=[portfolio_code#14119]) >>TungstenProject [portfolio_code#14119] >> BroadcastHashJoin [perm_group#13667], [anc_port_group#14117], >> BuildRight >> TungstenProject [perm_group#13667] >> BroadcastHashJoin [anc_user_group#13658], [user_group#13665], >> BuildRight >>ConvertToUnsafe >> Project [anc_user_group#13658] >> Filter (user_name#13659 = user) >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] >>ConvertToUnsafe >> Project [perm_group#13667,user_group#13665] >> Filter (perm_type#13666 = TEST) >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666] >> ConvertToUnsafe >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro
Re: Spark SQL Optimization
What we have observed so far is Spark picks join order in the same order as tables in from clause is specified. Sometimes reordering benefits the join query. This can be an inbuilt optimization in Spark. But again its not going to be straight forward, where rather than table size, selectivity of Join is important. Probably some kind of heuristic might help. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Mon, Mar 21, 2016 at 11:18 PM, gtinside <gtins...@gmail.com> wrote: > More details : > > Execution plan for Original query > select distinct pge.portfolio_code > from table1 pge join table2 p > on p.perm_group = pge.anc_port_group > join table3 uge > on p.user_group=uge.anc_user_group > where uge.user_name = 'user' and p.perm_type = 'TEST' > > == Physical Plan == > TungstenAggregate(key=[portfolio_code#14119], functions=[], > output=[portfolio_code#14119]) > TungstenExchange hashpartitioning(portfolio_code#14119) > TungstenAggregate(key=[portfolio_code#14119], functions=[], > output=[portfolio_code#14119]) >TungstenProject [portfolio_code#14119] > BroadcastHashJoin [user_group#13665], [anc_user_group#13658], > BuildRight > TungstenProject [portfolio_code#14119,user_group#13665] > BroadcastHashJoin [anc_port_group#14117], [perm_group#13667], > BuildRight >ConvertToUnsafe > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] >ConvertToUnsafe > Project [user_group#13665,perm_group#13667] > Filter (perm_type#13666 = TEST) > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666] > ConvertToUnsafe > Project [anc_user_group#13658] >Filter (user_name#13659 = user) > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] > > > > Execution plan for optimized query > select distinct pge.portfolio_code > from table1 uge, table2 p, table3 pge > where uge.user_name = 'user' and p.perm_type = 'TEST' > and p.perm_group = pge.anc_port_group > and p.user_group=uge.anc_user_group > > == Physical Plan == > TungstenAggregate(key=[portfolio_code#14119], functions=[], > output=[portfolio_code#14119]) > TungstenExchange hashpartitioning(portfolio_code#14119) > TungstenAggregate(key=[portfolio_code#14119], functions=[], > output=[portfolio_code#14119]) >TungstenProject [portfolio_code#14119] > BroadcastHashJoin [perm_group#13667], [anc_port_group#14117], > BuildRight > TungstenProject [perm_group#13667] > BroadcastHashJoin [anc_user_group#13658], [user_group#13665], > BuildRight >ConvertToUnsafe > Project [anc_user_group#13658] > Filter (user_name#13659 = user) > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] >ConvertToUnsafe > Project [perm_group#13667,user_group#13665] > Filter (perm_type#13666 = TEST) > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666] > ConvertToUnsafe > Scan > > ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] > > > > > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark SQL Optimization
More details : Execution plan for Original query select distinct pge.portfolio_code from table1 pge join table2 p on p.perm_group = pge.anc_port_group join table3 uge on p.user_group=uge.anc_user_group where uge.user_name = 'user' and p.perm_type = 'TEST' == Physical Plan == TungstenAggregate(key=[portfolio_code#14119], functions=[], output=[portfolio_code#14119]) TungstenExchange hashpartitioning(portfolio_code#14119) TungstenAggregate(key=[portfolio_code#14119], functions=[], output=[portfolio_code#14119]) TungstenProject [portfolio_code#14119] BroadcastHashJoin [user_group#13665], [anc_user_group#13658], BuildRight TungstenProject [portfolio_code#14119,user_group#13665] BroadcastHashJoin [anc_port_group#14117], [perm_group#13667], BuildRight ConvertToUnsafe Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] ConvertToUnsafe Project [user_group#13665,perm_group#13667] Filter (perm_type#13666 = TEST) Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][user_group#13665,perm_group#13667,perm_type#13666] ConvertToUnsafe Project [anc_user_group#13658] Filter (user_name#13659 = user) Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] Execution plan for optimized query select distinct pge.portfolio_code from table1 uge, table2 p, table3 pge where uge.user_name = 'user' and p.perm_type = 'TEST' and p.perm_group = pge.anc_port_group and p.user_group=uge.anc_user_group == Physical Plan == TungstenAggregate(key=[portfolio_code#14119], functions=[], output=[portfolio_code#14119]) TungstenExchange hashpartitioning(portfolio_code#14119) TungstenAggregate(key=[portfolio_code#14119], functions=[], output=[portfolio_code#14119]) TungstenProject [portfolio_code#14119] BroadcastHashJoin [perm_group#13667], [anc_port_group#14117], BuildRight TungstenProject [perm_group#13667] BroadcastHashJoin [anc_user_group#13658], [user_group#13665], BuildRight ConvertToUnsafe Project [anc_user_group#13658] Filter (user_name#13659 = user) Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table3.parquet][anc_user_group#13658,user_name#13659] ConvertToUnsafe Project [perm_group#13667,user_group#13665] Filter (perm_type#13666 = TEST) Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666] ConvertToUnsafe Scan ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Optimization
Hi, Maybe you can open a JIRA and upload your plan as Michael suggested. This is an interesting feature. Thanks! Xiao Li 2016-03-21 10:36 GMT-07:00 Michael Armbrust <mich...@databricks.com>: > It's helpful if you can include the output of EXPLAIN EXTENDED or > df.explain(true) whenever asking about query performance. > > On Mon, Mar 21, 2016 at 6:27 AM, gtinside <gtins...@gmail.com> wrote: > >> Hi , >> >> I am trying to execute a simple query with join on 3 tables. When I look >> at >> the execution plan , it varies with position of table in the "from" >> clause. >> Execution plan looks more optimized when the position of table with >> predicates is specified before any other table. >> >> >> Original query : >> >> select distinct pge.portfolio_code >> from table1 pge join table2 p >> on p.perm_group = pge.anc_port_group >> join table3 uge >> on p.user_group=uge.anc_user_group >> where uge.user_name = 'user' and p.perm_type = 'TEST' >> >> Optimized query (table with predicates is moved ahead): >> >> select distinct pge.portfolio_code >> from table1 uge, table2 p, table3 pge >> where uge.user_name = 'user' and p.perm_type = 'TEST' >> and p.perm_group = pge.anc_port_group >> and p.user_group=uge.anc_user_group >> >> >> Execution plan is more optimized for the optimized query and hence the >> query >> executes faster. All the tables are being sourced from parquet files >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Spark SQL Optimization
It's helpful if you can include the output of EXPLAIN EXTENDED or df.explain(true) whenever asking about query performance. On Mon, Mar 21, 2016 at 6:27 AM, gtinside <gtins...@gmail.com> wrote: > Hi , > > I am trying to execute a simple query with join on 3 tables. When I look at > the execution plan , it varies with position of table in the "from" clause. > Execution plan looks more optimized when the position of table with > predicates is specified before any other table. > > > Original query : > > select distinct pge.portfolio_code > from table1 pge join table2 p > on p.perm_group = pge.anc_port_group > join table3 uge > on p.user_group=uge.anc_user_group > where uge.user_name = 'user' and p.perm_type = 'TEST' > > Optimized query (table with predicates is moved ahead): > > select distinct pge.portfolio_code > from table1 uge, table2 p, table3 pge > where uge.user_name = 'user' and p.perm_type = 'TEST' > and p.perm_group = pge.anc_port_group > and p.user_group=uge.anc_user_group > > > Execution plan is more optimized for the optimized query and hence the > query > executes faster. All the tables are being sourced from parquet files > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Spark SQL Optimization
Hi , I am trying to execute a simple query with join on 3 tables. When I look at the execution plan , it varies with position of table in the "from" clause. Execution plan looks more optimized when the position of table with predicates is specified before any other table. Original query : select distinct pge.portfolio_code from table1 pge join table2 p on p.perm_group = pge.anc_port_group join table3 uge on p.user_group=uge.anc_user_group where uge.user_name = 'user' and p.perm_type = 'TEST' Optimized query (table with predicates is moved ahead): select distinct pge.portfolio_code from table1 uge, table2 p, table3 pge where uge.user_name = 'user' and p.perm_type = 'TEST' and p.perm_group = pge.anc_port_group and p.user_group=uge.anc_user_group Execution plan is more optimized for the optimized query and hence the query executes faster. All the tables are being sourced from parquet files -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org