[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866210#comment-15866210
 ] 

Nick Dimiduk commented on SPARK-13219:
--------------------------------------

bq. The result is that no matter the number of tables and join columns, the 
predicates are augmented such that = and IN on literals are pushed to all the 
joined tables.

[~velvia], [~smilegator], the above statement is an accurate reflection of this 
patch? I've upgraded to 2.1.0 in order to access this optimization. What about 
the case of a broadcast join -- will the values present in the smaller relation 
be pushed down to the scanner of the larger? I'm not seeing that behavior. 
Maybe this optimization is out of scope of this patch? I do see {{IsNotNull}} 
pushed down, applied the the columns involved in the join constraint, but I was 
expecting a but more.

Thanks again for the patch!

> Pushdown predicate propagation in SparkSQL with join
> ----------------------------------------------------
>
>                 Key: SPARK-13219
>                 URL: https://issues.apache.org/jira/browse/SPARK-13219
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.2, 1.6.0
>         Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>            Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
>     assetid int PRIMARY KEY,
>     address text,
>     propertyname text
> )
> CREATE TABLE tenants (
>     assetid int PRIMARY KEY,
>     name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>    Filter (CAST(assetid#13, DoubleType) = 1201.0)
>     HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>    HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to