[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

liyunzhang (JIRA) Fri, 15 Dec 2017 01:59:37 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292286#comment-16292286
 ]


liyunzhang commented on HIVE-17486:
-----------------------------------

[~xuefuz]: before you mentioned that the reason to disable caching for MapInput 
because of [IOContext initialization 
problem|https://issues.apache.org/jira/browse/HIVE-8920?focusedCommentId=14260846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14260846].
 And reading HIVE-9041, it shows an example to show IOContext initialization 
problem
{code}
I just found another bug regarding IOContext, when caching is turned on.
Taking the sample query above as example, right now I have this result plan:

   MW 1 (table0)   MW 2 (table1)   MW 3 (table0)   MW 4 (table1)
      \            /                 \             /
       \          /                   \           /
        \        /                     \         /
         \      /                       \       /
           RW 1                           RW 2
Suppose MapWorks are executed from left to right, also suppose we are just 
running with a single thread.
Then, the following will happen:
1. executing MW 1: since this is the first time we access table0, initialize 
IOContext and make input path point to table0;
2. executing MW 2: since this is the first time we access table1, initialize 
IOContext and make input path point to table1;
3. executing MW 3: since this is the second time access table0, do not 
initialize IOContext, and use the copy saved in step 2), which is table1.

Step 3 will then fail.
 how to make MW 3 know that it needs to get the saved IOContext from MW 1, but 
not MW 2
{code}
If the problem exists in the MapInput RDD cache because of IOContext is a 
static variable which will be stored in cache and the IOContext will be updated 
in different Maps. Why only disable in [MapInput rdd 
cache|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202]?
 
This should be disabled in all MapTrans.  Please explain more if have time.

> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>         Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

Reply via email to