Rick Kramer created SPARK-23012:
-----------------------------------
Summary: Support for predicate pushdown and partition pruning when
left joining large Hive tables
Key: SPARK-23012
URL: https://issues.apache.org/jira/browse/SPARK-23012
Project: Spark
Issue Type: Improvement
Components: Optimizer
Affects Versions: 2.2.0
Reporter: Rick Kramer
We have a hive view which left outer joins several large, partitioned orc hive
tables together on date. When the view is used in a hive query, hive pushes
date predicates down into the joins and prunes the partitions for all tables.
When I use this view from pyspark, the predicate is only used to prune the
left-most table and all partitions from the additional tables are selected.
For example, consider two partitioned hive tables a & b joined in a view:
create table a (
a_val string
)
partitioned by (ds string)
stored as orc;
create table b (
b_val string
)
partitioned by (ds string)
stored as orc;
create view example_view as
select
a_val
, b_val
, ds
from a
left outer join b on b.ds = a.ds
Then in pyspark you might try to query from the view filtering on ds:
spark.table('example_view').filter(F.col('ds') == '2018-01-01')
If table a and b are large, this results in a plan that filters a on ds =
2018-01-01, but selects scans all partitions of table b.
If the join in the view is changed to an inner join, the predicate gets pushed
down to a & b and the partitions are pruned as you'd expect.
In practice, the view is fairly complex and contains a lot of business logic
we'd prefer not to replicate in pyspark if we can avoid it.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]