[
https://issues.apache.org/jira/browse/HIVE-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216249#comment-16216249
]
Rui Li commented on HIVE-17877:
-------------------------------
Upload a PoC patch. Here're the main changes:
# Before combining, each {{SparkPartitionPruningSinkDesc}} can target only one
column in one map work. After combing, the remaining
{{SparkPartitionPruningSinkDesc}} will hold the columns and map works from
other equivalent {{SparkPartitionPruningSinkDesc}}.
# Two {{SparkPartitionPruningSinkDesc}} are equivalent if they have the same
TableDesc.
# When we combine two equivalent works, if they contain DPP sinks, we'll merge
the DPP sinks. Let's suppose we'll merge DPP1 and DPP2, which have target map
works Map1 and Map2 respectively. First we add the target column/work of DPP2
to DPP1. Then we update Map2 so that it knows it'll be pruned by DPP1 instead
of DPP2, i.e. updating the {{eventSource}} maps and tmp path.
# Currently {{CombineEquivalentWorkResolver}} doesn't handle leaf works. With
the patch, it'll handle leaf works if all leaf operators in the leaf works are
DPP sinks.
# Currently {{SparkPartitionPruningSinkOperator}} writes the target column name
into the output file. Since now it can have multiple target columns, it first
writes the number of columns and then writes all the target column names. In
order to make column names unique, the target map work ID will be prepended to
the column name.
# When {{SparkDynamicPartitionPruner}} reads the file, it reads in all the
column names and finds the {{SourceInfo}} whose name is in the column names.
> HoS: combine equivalent DPP sink works
> --------------------------------------
>
> Key: HIVE-17877
> URL: https://issues.apache.org/jira/browse/HIVE-17877
> Project: Hive
> Issue Type: Improvement
> Reporter: Rui Li
> Assignee: Rui Li
> Attachments: HIVE-17877.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)