[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292286#comment-16292286
]
liyunzhang commented on HIVE-17486:
-----------------------------------
[~xuefuz]: before you mentioned that the reason to disable caching for MapInput
because of [IOContext initialization
problem|https://issues.apache.org/jira/browse/HIVE-8920?focusedCommentId=14260846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14260846].
And reading HIVE-9041, it shows an example to show IOContext initialization
problem
{code}
I just found another bug regarding IOContext, when caching is turned on.
Taking the sample query above as example, right now I have this result plan:
MW 1 (table0) MW 2 (table1) MW 3 (table0) MW 4 (table1)
\ / \ /
\ / \ /
\ / \ /
\ / \ /
RW 1 RW 2
Suppose MapWorks are executed from left to right, also suppose we are just
running with a single thread.
Then, the following will happen:
1. executing MW 1: since this is the first time we access table0, initialize
IOContext and make input path point to table0;
2. executing MW 2: since this is the first time we access table1, initialize
IOContext and make input path point to table1;
3. executing MW 3: since this is the second time access table0, do not
initialize IOContext, and use the copy saved in step 2), which is table1.
Step 3 will then fail.
how to make MW 3 know that it needs to get the saved IOContext from MW 1, but
not MW 2
{code}
If the problem exists in the MapInput RDD cache because of IOContext is a
static variable which will be stored in cache and the IOContext will be updated
in different Maps. Why only disable in [MapInput rdd
cache|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202]?
This should be disabled in all MapTrans. Please explain more if have time.
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch,
> explain.28.share.false, explain.28.share.true, scanshare.after.svg,
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)