[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235054#comment-16235054
]
liyunzhang commented on HIVE-17486:
-----------------------------------
Now HoS does not support multiple edge between two vertex. Let's give an
example to show this.
TPC-DS/[query28.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query28.sql].
Before scan shared optimization(HIVE-16602). the tez explain is
[scanshare.before.svg|https://issues.apache.org/jira/secure/attachment/12895148/scanshare.before.svg].
After scan shared optimization(HIVE-16602), the tez explain is
[scanshare.after.svg|https://issues.apache.org/jira/secure/attachment/12895149/scanshare.after.svg].
We can see that after optimization, there is only 1 map(before there are 6
maps). But later the only map Map1 connects other 6 reducers by 6 edges. This
is because tez supports mulitple edges between two vertexes(TEZ-1190). Now i am
working on enabling this feature on HoS. But in HoS, it does not support "
mulitple edges between two vertexes". So even i change the physical plan as
what HoT does, it may not reduce the number of Map. [~lirui],[~xuefuz],can you
help to see the problem.
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
> Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)