Fangshi Li created SPARK-24256:
----------------------------------

             Summary: ExpressionEncoder should support user-defined types as 
fields of Scala case class and tuple
                 Key: SPARK-24256
                 URL: https://issues.apache.org/jira/browse/SPARK-24256
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Fangshi Li


Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply 
define a Scala case class/tuple with subfields being any other user-defined 
type, since ExpressionEncoder does not discover the implicit Encoder for the 
user-defined fields, thus can not use any Encoder to serde the user-defined 
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it 
work with conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to