[
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872811#comment-15872811
]
gagan taneja commented on SPARK-13219:
--------------------------------------
This is what we are looking for
For example
Table Address is partitioned based on postal_code
Table Location which contain location_name and potal_code
Query SELECT * FROM address JOIN location ON postal_code WHERE location_name =
'San Jose'
If the query is re-written as a in clause the optimizer will be able to prune
the partitions which would be significantly faster
> Pushdown predicate propagation in SparkSQL with join
> ----------------------------------------------------
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
> Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause
> in query on attributes used to perform the join, it is useful to apply that
> clause on scans for both table. If this is not done, one of the tables
> results in full scan which can reduce the query dramatically. Consider
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid =
> t.assetid and t.assetid='1201';
> WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to
> load native-hadoop library for your platform... using builtin-java classes
> where applicable
> == Physical Plan ==
> Project [name#14]
> ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
> Exchange (HashPartitioning 200)
> Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants,
> Some(t)), None
> Exchange (HashPartitioning 200)
> HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)),
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but
> it becomes cumbersome. It will be helpful if the query planner could improve
> filter propagation.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]