[
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fangshi Li updated SPARK-24256:
-------------------------------
Description:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as
scala case class, tuple and java bean class. Spark's Dataset natively supports
these mentioned types, but we find Dataset is not flexible for other
user-defined types and encoders.
For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset.
Although we can use AvroEncoder to define Dataset with types being the Avro
Generic or Specific Record, using such Avro typed Dataset has many limitations:
# We can not use joinWith on this Dataset since the result is a tuple, but
Avro types cannot be the field of this tuple.
# We can not use some type-safe aggregation methods on this Dataset, such as
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
# We cannot augment an Avro SpecificRecord with additional primitive fields
together in a case class, which we find is a very common use case.
The limitation is that ExpressionEncoder does not support serde of Scala case
class/tuple with subfields being any other user-defined type with its own
Encoder for serde.
To address this issue, we propose a trait as a contract(between
ExpressionEncoder and any other user-defined Encoder) to enable case
class/tuple/java bean to support user-defined types.
With this proposed patch and our minor modification in AvroEncoder, we remove
above-mentioned limitations with cluster-default conf
spark.expressionencoder.org.apache.avro.specific.SpecificRecord =
com.databricks.spark.avro.AvroEncoder$
This is a patch we have implemented internally and has been used for a few
quarters. We want to propose to upstream as we think this is a useful feature
to make Dataset more flexible to user types.
was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as
scala case class, tuple and java bean class. Spark's Dataset natively supports
these mentioned types, but we find Dataset is not flexible for other
user-defined types and encoders.
For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset.
Although we can use AvroEncoder to define Dataset with types being the Avro
Generic or Specific Record, using such Avro typed Dataset has many limitations:
1. We can not use joinWith on this Dataset since the result is a tuple, but
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields
together in a case class, which we find is a very common use case.
The limitation that Spark does not support define a Scala case class/tuple with
subfields being any other user-defined type, is because ExpressionEncoder does
not discover the implicit Encoder for the user-defined field types, thus can
not use any Encoder to serde the user-defined fields in case class/tuple.
To address this issue, we propose a trait as a contract(between
ExpressionEncoder and any other user-defined Encoder) to enable case
class/tuple/java bean's ExpressionEncoder to discover the
serializer/deserializer/schema from the Encoder of the user-defined type.
With this proposed patch and our minor modification in AvroEncoder, we remove
these limitations with cluster-default conf
spark.expressionencoder.org.apache.avro.specific.SpecificRecord =
com.databricks.spark.avro.AvroEncoder$
This is a patch we have implemented internally and has been used for a few
quarters. We want to propose to upstream as we think this is a useful feature
to make Dataset more flexible to user types.
> ExpressionEncoder should support user-defined types as fields of Scala case
> class and tuple
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Fangshi Li
> Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as
> scala case class, tuple and java bean class. Spark's Dataset natively
> supports these mentioned types, but we find Dataset is not flexible for other
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset.
> Although we can use AvroEncoder to define Dataset with types being the Avro
> Generic or Specific Record, using such Avro typed Dataset has many
> limitations:
> # We can not use joinWith on this Dataset since the result is a tuple, but
> Avro types cannot be the field of this tuple.
> # We can not use some type-safe aggregation methods on this Dataset, such as
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
> # We cannot augment an Avro SpecificRecord with additional primitive fields
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case
> class/tuple with subfields being any other user-defined type with its own
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between
> ExpressionEncoder and any other user-defined Encoder) to enable case
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove
> above-mentioned limitations with cluster-default conf
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord =
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few
> quarters. We want to propose to upstream as we think this is a useful feature
> to make Dataset more flexible to user types.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]