GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/7685

    [SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Three SQL optimziations

    This PR has three SQL optimizations.
    
    First, it adds an optimization rule `FilterNullsInJoinKey` to add `Filter` 
before join operators to filter out rows having null values for join keys.
    
    Second, it adds `NullUnsafeClusteredDistribution` and 
`NullUnsafeHashPartitioning`, which can be used to distribute rows having null 
values for join keys evenly. `NullUnsafeClusteredDistribution` is basically the 
same with `ClusteredDistribution` (now renamed to 
`NullSafeClusteredDistribution`) except that it does not require rows having 
null values for join keys be clustered.
    
    Third, it adds `PartitioningCollection`, which is used to represent the 
`outputPartitioning` for `SparkPlan`s with multiple children (e.g. 
`ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its 
partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two 
descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and 
`right.outputPartitioning`.
    
    I will add more comments/doc and test later.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark advancedQueryOptimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7685.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7685
    
----
commit 9f214b798b3e505a3f766c294bb78b811ff4278a
Author: Yin Huai <[email protected]>
Date:   2015-07-23T19:21:08Z

    Filter out rows that will not be joined in equal joins early.

commit dc94cd57ebe8b4b62532343a6d9fa9ef39f01ecc
Author: Yin Huai <[email protected]>
Date:   2015-07-24T01:48:47Z

    Do not add unnessary filters.

commit 96406a826ce175a2f70625ccac4d573a5a05c029
Author: Yin Huai <[email protected]>
Date:   2015-07-24T01:49:34Z

    Introduce NullSafeHashPartitioning and NullUnsafePartitioning.

commit aa81761f1f679bf2dda0e84b30424fc31f843e2b
Author: Yin Huai <[email protected]>
Date:   2015-07-26T22:51:38Z

    Bug fix and refactoring.

commit c8012468eea3f31ff8f9c182c120fc6c7c6d1f26
Author: Yin Huai <[email protected]>
Date:   2015-07-27T03:28:49Z

    wip

commit e66d5a90f5b9f04bff59c2ef47f4d2e4e4299978
Author: Yin Huai <[email protected]>
Date:   2015-07-27T05:03:46Z

    Add PartitioningCollection.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to