[GitHub] spark pull request #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder...

fangshil Fri, 11 May 2018 23:19:48 -0700

GitHub user fangshil opened a pull request:

    https://github.com/apache/spark/pull/21310


    [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should support 
user-defined types as fields of Scala case class and tuple

    ## What changes were proposed in this pull request?
    
    Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find Dataset is not flexible for other 
user-defined types and encoders.
    
    For example, spark-avro has an AvroEncoder for ser/de Avro types in 
Dataset. Although we can use AvroEncoder to define Dataset with types being the 
Avro Generic or Specific Record, using such Avro typed Dataset has many 
limitations: 
    1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
    2. We can not use some type-safe aggregation methods on this Dataset, such 
as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
    3. We cannot augment an Avro SpecificRecord with additional primitive 
fields together in a case class, which we find is a very common use case.
    
    The limitation that Spark does not support define a Scala case class/tuple 
with subfields being any other user-defined type, is because ExpressionEncoder 
does not discover the implicit Encoder for the user-defined field types, thus 
can not use any Encoder to serde the user-defined fields in case class/tuple.
    
    To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.
    
    With this proposed patch and our minor modification in AvroEncoder, we 
remove these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$
    
    This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.
    
    
    ## How was this patch tested?
    We have tested this patch internally. Did not write unit test since the 
user-defined Encoder(AvroEncoder) is defined outside Spark. We look for 
comments on how to write unit tests for this path.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fangshil/spark SPARK-24256

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21310.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21310
    
----
commit 547ff81e0470bed14371996da89924bfed0cc101
Author: Fangshi Li <fli@...>
Date:   2018-02-02T02:16:14Z

    [SPARK-24256][SQL]ExpressionEncoder should support user-defined types as 
fields of Scala case class and tuple

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder...

Reply via email to