[jira] [Created] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

Rick Kramer (JIRA) Tue, 09 Jan 2018 13:00:15 -0800

Rick Kramer created SPARK-23012:
-----------------------------------

             Summary: Support for predicate pushdown and partition pruning when 
left joining large Hive tables
                 Key: SPARK-23012
                 URL: https://issues.apache.org/jira/browse/SPARK-23012
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer
    Affects Versions: 2.2.0
            Reporter: Rick Kramer



We have a hive view which left outer joins several large, partitioned orc hive 
tables together on date. When the view is used in a hive query, hive pushes 
date predicates down into the joins and prunes the partitions for all tables. 
When I use this view from pyspark, the predicate is only used to prune the 
left-most table and all partitions from the additional tables are selected.

For example, consider two partitioned hive tables a & b joined in a view:

create table a (
   a_val string
)
partitioned by (ds string)
stored as orc;

create table b (
   b_val string
)
partitioned by (ds string)
stored as orc;

create view example_view as
select
    a_val
    , b_val
    , ds
from a 
left outer join b on b.ds = a.ds

Then in pyspark you might try to query from the view filtering on ds:

spark.table('example_view').filter(F.col('ds') == '2018-01-01')

If table a and b are large, this results in a plan that filters a on ds = 
2018-01-01, but selects scans all partitions of table b.

If the join in the view is changed to an inner join, the predicate gets pushed 
down to a & b and the partitions are pruned as you'd expect.

In practice, the view is fairly complex and contains a lot of business logic 
we'd prefer not to replicate in pyspark if we can avoid it.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

Reply via email to