[GitHub] spark pull request: [SPARK-2176][SQL] Extra unnecessary exchange o...

yhuai Wed, 18 Jun 2014 10:13:36 -0700

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1116


    [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an 
explain command

    ```
    hql("explain select * from src group by key").collect().foreach(println)
    
    [ExplainCommand [plan#27:0]]
    [ Aggregate false, [key#25], [key#25,value#26]]
    [  Exchange (HashPartitioning [key#25:0], 200)]
    [   Exchange (HashPartitioning [key#25:0], 200)]
    [    Aggregate true, [key#25], [key#25]]
    [     HiveTableScan [key#25,value#26], (MetastoreRelation default, src, 
None), None]
    ```
    
    There are two exchange operators.
    
    However, if we do not use explain...
    ```
    hql("select * from src group by key")
    
    res4: org.apache.spark.sql.SchemaRDD = 
    SchemaRDD[8] at RDD at SchemaRDD.scala:100
    == Query Plan ==
    Aggregate false, [key#8], [key#8,value#9]
     Exchange (HashPartitioning [key#8:0], 200)
      Aggregate true, [key#8], [key#8]
       HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), 
None
    ```
    The plan is fine.
    
    The cause of this bug is explained below.
    
    When we create an `execution.ExplainCommand`, we use the `executedPlan` as 
the child of this `ExplainCommand`. But, this `executedPlan` is prepared for 
execution again when we generate the `executedPlan` for the `ExplainCommand`. 
Basically, `prepareForExecution` is called twice on a physical plan. Because 
after `prepareForExecution` we have already bounded those references (in 
`BoundReference`s), `AddExchange` cannot figure out we are using the same 
partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and 
then those references will be changed to `BoundReference`s after 
`prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.
    
    
    I think in `CommandStrategy`, we should just use the `sparkPlan` 
(`sparkPlan` is the input of `prepareForExecution`) to initialize the 
`ExplainCommand` instead of using `executedPlan`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark SPARK-2176

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1116
    
----
commit 197c19c1bbbdeb83b14f4de1ccdaf94dd56e95a9
Author: Yin Huai <[email protected]>
Date:   2014-06-18T17:07:52Z

    Use sparkPlan to initialize a Physical Explain Command instead of using 
executedPlan.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2176][SQL] Extra unnecessary exchange o...

Reply via email to