GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/9300

    [SPARK-11347][SQL] Support for joinWith in Datasets

    This PR adds a new operation `joinWith` to a `Dataset`, which returns a 
`Tuple` for each pair where a given `condition` evaluates to true.
    
    ```scala
    case class ClassData(a: String, b: Int)
    
    val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS()
    val ds2 = Seq(("a", 1), ("b", 2)).toDS()
    
    > ds1.joinWith(ds2, $"_1" === $"a").collect()
    res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2)))
    ```
    
    This operation is similar to the relation `join` function with one 
important difference in the result schema. Since `joinWith` preserves objects 
present on either side of the join, the result schema is similarly nested into 
a tuple under the column names `_1` and `_2`.
    
    This type of join can be useful both for preserving type-safety with the 
original object types as well as working with relational data where either side 
of the join has column names in common.
    
    ## Required Changes to Encoders
    In the process of working on this patch, several deficiencies to the way 
that we were handling encoders were discovered.  Specifically, it turned out to 
be very difficult to `rebind` the non-expression based encoders to extract the 
nested objects from the results of joins (and also typed selects that return 
tuples).
    
    As a result the following changes were made.
     - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been 
improved to also handle primitive types.
     - All internal operations on `Dataset`s now operate on an 
`ExpressionEncoder`.  If the users tries to pass a non-`ExpressionEncoder` in, 
an error will be throw.  We can relax this requirement in the future by 
constructing a wrapper class that uses expressions to project the row to the 
expected schema, shielding the users code from the required remapping.  This 
will give us a nice balance where we don't force user encoders to understand 
attribute references and binding, but still allow our native encoder to 
leverage runtime code generation to construct specific encoders for a given 
schema that avoid an extra remapping step.
     - Additionally, the semantics for different types of objects are now 
better defined.  As stated in the `ExpressionEncoder` scaladoc:
      - Classes will have their sub fields extracted by name using 
`UnresolvedAttribute` expressions
      and `UnresolvedExtractValue` expressions.
      - Tuples will have their subfields extracted by position using 
`BoundReference` expressions.
      - Primitives will have their values extracted from the first ordinal with 
a schema that defaults
      to the name `value`.
     - Finally, the binding lifecycle for `Encoders` has now been unified 
across the codebase.  Encoders are now `resolved` to the appropriate schema in 
the constructor of `Dataset`.  This process replaces an unresolved expressions 
with concrete `AttributeReference` expressions.  Binding then happens on 
demand, when an encoder is going to be used to construct an object.  This 
closely mirrors the lifecycle for standard expressions when executing normal 
SQL or `DataFrame` queries.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark datasets-tuples

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9300.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9300
    
----
commit e88dfab83fb2386d27d36ad37cff18e7c90293bf
Author: Michael Armbrust <[email protected]>
Date:   2015-10-27T14:14:00Z

    [SPARK-11347][SQL] Support for joinWith in Datasets

commit c515001d338acb37de0c2cfb37883e6a7c04c9af
Author: Michael Armbrust <[email protected]>
Date:   2015-10-27T14:37:29Z

    cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to