[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235094#comment-16235094
]
liyunzhang commented on HIVE-17486:
-----------------------------------
[~xuefuz]:
{quote}
My gut feeling is that this needs to be combined with Spark RDD caching or
Hive's materialized view.
{quote}
About the optimization, I found that Hive on Tez can get indeed
improvement(20%+) in TPC-DS/query28,88,90 on not excellent hw or in table scan
with huge data. So I want to implement it on the Hive on Spark.
I agree that we need to combine Spark RDD caching with the optimization to
reduce the table scan. As you described, the multi-insert case benefits from
the Spark RDD caching because map12=map13. But more complex cases can not. Use
DS/query28.sql as an example.
The physical plan:
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[7]-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
TS[14]-FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
TS[21]-FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
TS[28]-FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
TS[35]-FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]
{code}
After the scan share optimization, the phyiscal plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
-FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
-FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
-FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
-FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]
{code}
HoS will split operators trees when encounting {{RS}}.
{code}
Map1: TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: TS[0]-FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: TS[0]-FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: TS[0]-FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: TS[0] -FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: TS[0]-FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}
We can not combine Map1,..., Map6 because the {{FIL}}(FIL\[52\],
FIL\[53\],...,FIL\[57\]) are not same.
So what i think about can we directly extract TS from MapTask and put the TS to
a single Map
{code}
Map0: TS[0]
Map1: FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}
There is only TS\[0\] in the Map0 and connect Map0 to Map1,...,Map6.
Appreciate to get some suggestion from you!
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
> Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)