[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237271#comment-16237271
]
liyunzhang commented on HIVE-17486:
-----------------------------------
[~lirui]:
{quote}
I also think that's possible in theory. But I guess it will require lots of
work. E.g. we may need to modify MapOperator to accommodate the new M->M->R
scheme
{quote}
now i am working on changing from {{M->R}} to {{M->M->R}} schema. Not very
clear about the modification on MapOperator. If you know, please say more
detailed. I think at first need change {{GenSparkWork}} to split the physical
operator trees once en-counting one TS has more than 1 child. For example
physical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}
As TS\[0\] has two children(FIL\[52\], FIL\[53\]). First split at TS\[0\] and
bring it to Map1, then split following operator trees when en counting RS. So
the final operator tree will be
{code}
Map1: TS[0]
Map2:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map3:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
This is very initial thinking. If have suggestion, please tell me, thanks!
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
> Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)