[ 
https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862699#comment-16862699
 ] 

Mathew Wicks commented on SPARK-28008:
--------------------------------------

The only issue I could think, would be that the column comments aren't saved. 
(Which some users might want)

 

While I agree it doesn't seem like the api should be public, it is useful to 
know what schema a dataframe will be written with. (Some spark type have to be 
converted for avro). Also, the user might want to make changes and then use the 
"avroSchema" writer option, for example, writing timestamps in 
"timestamp-milis" type rather than "timestamp-micro".

 

Beyond that, is there really harm in having a more correct conversion from the 
StructType into AVRO Schema?

> Default values & column comments in AVRO schema converters
> ----------------------------------------------------------
>
>                 Key: SPARK-28008
>                 URL: https://issues.apache.org/jira/browse/SPARK-28008
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: Mathew Wicks
>            Priority: Major
>
> Currently in both `toAvroType` and `toSqlType` 
> [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134]
>  there are two behaviours which are unexpected.
> h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no 
> default value is set:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : [ "string", "null" ]
>   } ]
> }
> {code}
> *Expected Behaviour:*
> (NOTE: The reversal of "null" & "string" in the union, needed for a default 
> value of null)
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : [ "null", "string" ],
>     "default" : null
>   } ]
> }{code}
> h2. Field comments/metadata is not propagated:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, 
> comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : "string"
>   } ]
> }{code}
> *Expected Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, 
> comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : "string",
>     "doc" : "AAAAAAA"
>   } ]
> }{code}
>  
> The behaviour should be similar (but the reverse) for `toSqlType`.
> I think we should aim to get this in before 3.0, as it will probably be a 
> breaking change for some usage of the AVRO API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to