[jira] [Updated] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

Xinli Shang (Jira) Fri, 18 Oct 2019 15:04:57 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xinli Shang updated PARQUET-1681:
---------------------------------
    Description: 
When using the Avro schema below to write a parquet(1.8.1) file and then read 
back by using parquet 1.10.1 without passing any schema, the reading throws an 
exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 

           {

              "name": "phones",

              "type": [

                "null",

                {

                  "type": "array",

                  "items": {

                    "type": "record",

                    "name": "phones_items",

                    "fields": [

                      

{                         "name": "phone_number",                         
"type": [                           "null",                           "string"  
                       ],                         "default": null               
        }

                    ]

                  }

                }

              ],

              "default": null

            }

The code to read is as below 

     val reader = 
AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
Configuration).build()

    reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's 
checkReaderWriterCompatibility() to check the compatibility. However, 
checkReaderWriterCompatibility() consider the ParquetSchema and the 
AvroSchema(converted from File schema) as not compatible(the name in avro 
schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not 
compatible) . Hence return false and caused the “phone_number” field in the 
above schema to be considered as group type which is not true. Then the 
exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
not. But it could because the translation of Avro schema to Parquet schema is 
not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would 
like to hear the community's thoughts on it. 

  was:
When using the Avro schema below to write a parquet(1.8.1) file and then read 
back by using parquet 1.10.1 without passing any schema, the reading throws an 
exception "XXX is not a group" . Reading throw parquet 1.8.1 is fine. 

           {

              "name": "phones",

              "type": [

                "null",

                {

                  "type": "array",

                  "items": {

                    "type": "record",

                    "name": "phones_items",

                    "fields": [

                      {

                        "name": "phone_number",

                        "type": [

                          "null",

                          "string"

                        ],

                        "default": null

                      }

                    ]

                  }

                }

              ],

              "default": null

            }

The code to read is as below 

     val reader = 
AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
Configuration).build()

    reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's 
checkReaderWriterCompatibility() to check the compatibility. However, 
checkReaderWriterCompatibility() consider the ParquetSchema and the 
AvroSchema(converted from File schema) as not compatible(the name in avro 
schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not 
compatible) . Hence return false and caused the “phone_number” field in the 
above schema to be considered as group type which is not true. Then the 
exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
not. But it could because the translation of Avro schema to Parquet schema is 
not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would 
like to hear the community's thoughts on it. 


> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1681
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1681
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.10.0, 1.9.1, 1.11.0
>            Reporter: Xinli Shang
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>            {
>               "name": "phones",
>               "type": [
>                 "null",
>                 {
>                   "type": "array",
>                   "items": {
>                     "type": "record",
>                     "name": "phones_items",
>                     "fields": [
>                       
> {                         "name": "phone_number",                         
> "type": [                           "null",                           
> "string"                         ],                         "default": null   
>                     }
>                     ]
>                   }
>                 }
>               ],
>               "default": null
>             }
> The code to read is as below 
>      val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
>     reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

Reply via email to