[
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278167#comment-16278167
]
liyunzhang commented on HIVE-17486:
-----------------------------------
Here record the problems currently I met
1. I want to change the M->R to M->M->R and split the operator tree when
encountering TS. I create [SparkRuleDispatcher|
https://github.com/kellyzly/hive/blob/HIVE-17486.3/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkRuleDispatcher.java]
to apply rules to the operator tree, the reason why i don't use
DefaultRuleDispatcher is because there already a rule called [Handle Analyze
Command|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L432]
to split operator trees once encountering TS. Original
[SparkCompiler#opRules|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L417]
is a linkedHashMap which stores one key with one value. It can not deal with
the case where one key with two values. So current solution is to modify
SparkCompile#opRules to a Multimap and create SparkRuleDispatcher . But I am
afraid once encountering TS, only 1 rule will be applied for
{{SparkRuleDispatcher#dispatch}}
SparkRuleDispatcher#dispatch
{code}
@Override
public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
throws SemanticException {
// find the firing rule
// find the rule from the stack specified
Rule rule = null;
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
int cost = r.cost(ndStack);
if ((cost >= 0) && (cost < minCost)) {
minCost = cost;
// Here I am afraid there is only 1 rule will be applied even there are
two rules for TS
rule = r;
}
}
Collection<NodeProcessor> procSet;
if (rule == null) {
procSet = defaultProcSet;
} else {
procSet = procRules.get(rule);
}
// Do nothing in case proc is null
Object ret = null;
for (NodeProcessor proc : procSet) {
if (proc != null) {
// Call the process function
ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
}
}
return ret;
}
{code}
I can change above code like following but don't know return the result of
which rule if there are more than 1 rule for TS.
{code}
@Override
public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
throws SemanticException {
// find the firing rule
// find the rule from the stack specified
ArrayList ruleList =new ArrayList();
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
int cost = r.cost(ndStack);
if ((cost >= 0) && (cost < minCost)) {
minCost = cost;
ruleList.add(r);
}
}
Collection<NodeProcessor> procSet;
if (ruleList.size() == 0) {
procSet = defaultProcSet;
} else {
for(Rule r: ruleList) {
// Question: Here I don't know which rule I should use if there is more
than 1 rule in the ruleList
procSet = procRules.get(r);
}
}
// Do nothing in case proc is null
Object ret = null;
for (NodeProcessor proc : procSet) {
if (proc != null) {
// Call the process function
ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
}
}
return ret;
}
}
{code}
[~lirui], [~xuefuz] can you give your suggestions about the problem?
> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang
> Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false,
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be
> merged so the data is read only once. Optimization will be carried out at the
> physical level. In Hive on Spark, it caches the result of spark work if the
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer
> is enabled in physical plan in HoS, the identical table scans are merged to 1
> table scan. This result of table scan will be used by more 1 child spark
> work. Thus we need not do the same computation because of cache mechanism.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)