[
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated HIVE-11297:
------------------------------------
Attachment: HIVE-11297.1.patch
[~csun]: update patch, as in my environment,[case "multiple sources, single
key"|https://issues.apache.org/jira/browse/HIVE-16780] in
spark_dynamic_pruning.q fails, i could not generate new
spark_dynamic_partition_pruning.q.out. I extract the test case about "multi
columns, single source" in a new qfile
"spark_dynamic_partition_pruning_combine.q"( here i create a configuration item
" hive.spark.dynamic.partition.pruning.combine" ,so if this config item is not
enabled, combine op trees for partiition info will not happen)
{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.spark.dynamic.partition.pruning.combine=true;
-- SORT_QUERY_RESULTS
create table srcpart_date_hour as select ds as ds, ds as `date`, hr as hr, hr
as hour from srcpart group by ds, hr;
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
set hive.spark.dynamic.partition.pruning.combine=false;
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}
I think we can parallel, you can review and i continue to fix HIVE-16780. after
fixing HIVE-16780 in my environment, i can update the
spark_dynamic_partition_pruning.q.out with the change of HIVE-11297.
> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
> Issue Type: Bug
> Affects Versions: spark-branch
> Reporter: Chao Sun
> Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates
> partition info for more than one partition columns, multiple operator trees
> are created, which all start from the same table scan op, but have different
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do
> table scan multiple times.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)