[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fangshi Li updated SPARK-24256: ------------------------------- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: # We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. # We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. # We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation is that ExpressionEncoder does not support serde of Scala case class/tuple with subfields being any other user-defined type with its own Encoder for serde. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean to support user-defined types. With this proposed patch and our minor modification in AvroEncoder, we remove above-mentioned limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > ------------------------------------------------------------------------------------------- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Fangshi Li > Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org