[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235094#comment-16235094
 ] 

liyunzhang commented on HIVE-17486:
-----------------------------------

[~xuefuz]:

{quote}
 My gut feeling is that this needs to be combined with Spark RDD caching or 
Hive's materialized view.
{quote}
 About the optimization, I found that Hive on Tez can get indeed 
improvement(20%+) in TPC-DS/query28,88,90 on not excellent hw or in table scan 
with huge data. So I want to implement it on the Hive on Spark.  
 I agree that we need to combine Spark RDD caching with the optimization to 
reduce the table scan. As you described, the multi-insert case  benefits from 
the Spark RDD caching because map12=map13. But more complex cases can not. Use 
DS/query28.sql as an example.
 The physical plan:
 {code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[7]-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
TS[14]-FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
TS[21]-FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
TS[28]-FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
TS[35]-FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]
{code}

After the scan share optimization, the phyiscal plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
     -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
     -FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
     -FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
     -FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
     -FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]

{code}

HoS will split operators trees when encounting {{RS}}.
{code}
Map1: TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: TS[0]-FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: TS[0]-FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: TS[0]-FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: TS[0] -FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: TS[0]-FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}

We can not combine Map1,..., Map6 because the {{FIL}}(FIL\[52\], 
FIL\[53\],...,FIL\[57\]) are not same.
So what i think about can we directly extract TS from MapTask and put the TS to 
a single Map
{code}
Map0: TS[0]
Map1: FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}
There is only TS\[0\] in the Map0 and connect Map0 to Map1,...,Map6.  
Appreciate to get some suggestion from you!


> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>            Priority: Major
>         Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to