[ 
https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494294#comment-17494294
 ] 

Erik Krogen commented on SPARK-38245:
-------------------------------------

This behavior is expected. The fields of the union are expanded based on their 
position.

 

I guess you're proposing that the name of the type be used as the name of the 
field, instead of a positional name? This will get pretty confusing for unions 
of primitives, e.g. the following type:
{code:java}
{
  "name": "foo"
  "type": ["int", "long"]
} {code}
will have Spark type:
{code:java}
root
 |-- foo: struct
 |    |-- int: int
 |    |-- long: long {code}
Names of types being used as the name of a field looks very confusing, at least 
to me.

Another problem, you could end up with duplicate field names like:
{code:java}
{
  "name": "foo"
  "type": [{
      "type": "record",
      "name": "RecordOne",
      "namespace": "foo"
     }, {
      "type": "record",
      "name": "RecordOne",
      "namespace": "bar"
    }
  ]
} {code}
Since namespaces are different, they are different types, and this is a valid 
union. But in your proposal, they will result in the same field name. Unless we 
include the namespace in the field name as well, but this will get messy 
quickly.

> Avro Complex Union Type return `member$I`
> -----------------------------------------
>
>                 Key: SPARK-38245
>                 URL: https://issues.apache.org/jira/browse/SPARK-38245
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.1
>         Environment: +OS+
>  * Debian GNU/Linux 10 (Docker Container)
> +packages & others+
>  * spark-avro_2.12-3.2.1
>  * python 3.7.3
>  * pyspark 3.2.1
>  * spark-3.2.1-bin-hadoop3.2
>  * Docker version 20.10.12
>            Reporter: Teddy Crepineau
>            Priority: Major
>              Labels: avro, newbie
>
> *Short Description*
> When reading complex union types from Avro files, there seems to be some 
> information lost as the name of the record is omitted and {{member$i}} is 
> instead returned.
> *Long Description*
> +Error+
> Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when 
> reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. 
> Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} 
> became {{{}member0{}}}, etc.
> This causes information lost and makes the DataFrame unusable.
> From my understanding this behavior was implemented 
> [here.|https://github.com/databricks/spark-avro/pull/117]
>  
> {code:java|title=read_avro.py}
> df = spark.read.format("avro").load("path/to/my/file.avro")
> df.printSchema()
>  {code}
> {code:java|title=schema.avsc}
>  {
>  "type": "record",
>  "name": "SomeData",
>  "namespace": "my.name.space",
>  "fields": [
>   {
>    "name": "ts",
>    "type": {
>     "type": "long",
>     "logicalType": "timestamp-millis"
>    }
>   },
>   {
>    "name": "field_id",
>    "type": [
>     "null",
>     "string"
>    ],
>    "default": null
>   },
>   {
>    "name": "values",
>    "type": [
>     {
>      "type": "record",
>      "name": "RecordOne",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       },
>       {
>        "name": "field_b",
>        "type": {
>         "type": "enum",
>         "name": "FieldB",
>         "symbols": [
>             "..."
>         ],
>        }
>       },
>       {
>        "name": "field_C",
>        "type": {
>         "type": "array",
>         "items": "long"
>        }
>       }
>      ]
>     },
>     {
>      "type": "record",
>      "name": "RecordTwo",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       }
>      ]
>     }
>    ]
>   }
>  ]
> }{code}
> {code:java|title=expected.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- RecordOne: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- RecordTwo: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}
> {code:java|title=reality.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  |    |-- member0: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
>  |    |    |-- field_b: string (nullable = true)
>  |    |    |-- field_c: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |-- member1: struct (nullable = true)
>  |    |    |-- field_a: long (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to