[ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472933#comment-16472933
 ] 

Apache Spark commented on SPARK-24256:
--------------------------------------

User 'fangshil' has created a pull request for this issue:
https://github.com/apache/spark/pull/21310

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24256
>                 URL: https://issues.apache.org/jira/browse/SPARK-24256
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Fangshi Li
>            Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations: 
>  1. We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  2. We can not use some type-safe aggregation methods on this Dataset, such 
> as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  3. We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation that Spark does not support define a Scala case class/tuple 
> with subfields being any other user-defined type, is because 
> ExpressionEncoder does not discover the implicit Encoder for the user-defined 
> field types, thus can not use any Encoder to serde the user-defined fields in 
> case class/tuple.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean's ExpressionEncoder to discover the 
> serializer/deserializer/schema from the Encoder of the user-defined type.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> these limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to