[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278167#comment-16278167
 ] 

liyunzhang commented on HIVE-17486:
-----------------------------------

Here record the problems currently I met
1. I want to change the M->R to M->M->R and split the operator tree when 
encountering TS. I create [SparkRuleDispatcher| 
https://github.com/kellyzly/hive/blob/HIVE-17486.3/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkRuleDispatcher.java]
 to apply rules to the operator tree, the reason why i don't use 
DefaultRuleDispatcher is because there already a rule called [Handle Analyze 
Command|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L432]
 to split operator trees once encountering TS. Original 
[SparkCompiler#opRules|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L417]
 is a linkedHashMap which stores one key with one value. It can not deal with 
the case where one key with two values. So current solution is to modify 
SparkCompile#opRules to a Multimap and create SparkRuleDispatcher . But I am 
afraid once encountering TS, only 1 rule will be applied  for 
{{SparkRuleDispatcher#dispatch}}

SparkRuleDispatcher#dispatch
{code}

@Override
  public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
      throws SemanticException {

    // find the firing rule
    // find the rule from the stack specified
    Rule rule = null;
    int minCost = Integer.MAX_VALUE;
    for (Rule r : procRules.keySet()) {
      int cost = r.cost(ndStack);
      if ((cost >= 0) && (cost < minCost)) {
        minCost = cost;
        // Here I am afraid there is only 1 rule will be applied even there are 
two rules for TS
        rule = r;
      }
    }

    Collection<NodeProcessor> procSet;

    if (rule == null) {
      procSet = defaultProcSet;
    } else {
      procSet = procRules.get(rule);
    }

    // Do nothing in case proc is null
    Object ret = null;
    for (NodeProcessor proc : procSet) {
      if (proc != null) {
        // Call the process function
        ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
      }
    }
    return ret;
  }
{code}

I can change above code like following but don't know return the result of 
which rule if there are more than 1 rule for TS.
{code}
  @Override
  public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
      throws SemanticException {

    // find the firing rule
    // find the rule from the stack specified
    ArrayList ruleList =new ArrayList();
    int minCost = Integer.MAX_VALUE;
    for (Rule r : procRules.keySet()) {
      int cost = r.cost(ndStack);
      if ((cost >= 0) && (cost < minCost)) {
        minCost = cost;
        ruleList.add(r);
      }
    }

    Collection<NodeProcessor> procSet;

    if (ruleList.size() == 0) {
      procSet = defaultProcSet;
    } else {
      for(Rule r: ruleList) {
        // Question: Here I don't know which rule I should use if there is more 
than 1 rule in the ruleList
        procSet = procRules.get(r);
      }
    }

    // Do nothing in case proc is null
    Object ret = null;
    for (NodeProcessor proc : procSet) {
      if (proc != null) {
        // Call the process function
        ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
      }
    }
    return ret;

  }
}

{code}
[~lirui], [~xuefuz] can you give your suggestions about the problem?


> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>         Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to