[
https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-28008:
----------------------------------
Affects Version/s: (was: 2.4.3)
3.0.0
> Default values & column comments in AVRO schema converters
> ----------------------------------------------------------
>
> Key: SPARK-28008
> URL: https://issues.apache.org/jira/browse/SPARK-28008
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Mathew Wicks
> Priority: Major
>
> Currently in both `toAvroType` and `toSqlType`
> [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134]
> there are two behaviours which are unexpected.
> h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no
> default value is set:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
> "type" : "record",
> "name" : "topLevelRecord",
> "fields" : [ {
> "name" : "a",
> "type" : [ "string", "null" ]
> } ]
> }
> {code}
> *Expected Behaviour:*
> (NOTE: The reversal of "null" & "string" in the union, needed for a default
> value of null)
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
> "type" : "record",
> "name" : "topLevelRecord",
> "fields" : [ {
> "name" : "a",
> "type" : [ "null", "string" ],
> "default" : null
> } ]
> }{code}
> h2. Field comments/metadata is not propagated:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false,
> comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
> "type" : "record",
> "name" : "topLevelRecord",
> "fields" : [ {
> "name" : "a",
> "type" : "string"
> } ]
> }{code}
> *Expected Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false,
> comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
> "type" : "record",
> "name" : "topLevelRecord",
> "fields" : [ {
> "name" : "a",
> "type" : "string",
> "doc" : "AAAAAAA"
> } ]
> }{code}
>
> The behaviour should be similar (but the reverse) for `toSqlType`.
> I think we should aim to get this in before 3.0, as it will probably be a
> breaking change for some usage of the AVRO API.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]