Hi Cheng, Thanks a lot for your help!
A quick question: do we need to regenerate the data to reflect this change? Thanks, Yan On Sat, Sep 19, 2015 at 12:55 AM, Cheng Lian <[email protected]> wrote: > To be more specific, the Parquet schema converted from the Avro schema is: > > message avro.parquet.model.Profile { > required group nest2 (LIST) { > repeated group array { > required group nest1 { > required int32 mpi; > required int32 ts; > } > } > } > } > > The second level "array" field should be viewed as element type according > to the following LIST backwards-compatibility rule: > > If the repeated field is a group with one field and is named either > |array| or uses the |LIST|-annotated group's name with |_tuple| > appended then the repeated type is the element type and elements are > required. > > However, this rule is only handled in AvroSchemaConverter but not > AvroIndexedRecordConverter. Thus, "array" is interpreted as the element > wrapper group, and the element type of the nest2 is considered to be > > required group nest1 { > required int32 mpi; > required int32 ts; > } > > rather than > > required group Nest2 { > required group nest1 { > required int32 mpi; > required int32 ts; > } > } > > This PR should probably fix your issue (although I still have some > questions about the testing code) > https://github.com/apache/parquet-mr/pull/264 > > BTW, recently released Spark SQL 1.5 should be able to read such Parquet > files, but you need some extra work if you want to convert Spark SQL row > objects to Avro records. This PR fixed this issue for Spark SQL: > https://github.com/apache/spark/pull/8361 > > Cheng > > > On 9/19/15 12:40 AM, Cheng Lian wrote: > >> You are hitting PARQUET-364 >> https://issues.apache.org/jira/browse/PARQUET-364 >> >> Although the symptoms are different, they share the same root cause. >> >> Cheng >> >> On 9/18/15 6:13 PM, Yan Qi wrote: >> >>> Hi, >>> >>> The data I am working on have the following schema: >>> >>> { >>> "namespace": "avro.parquet.model", >>> "type": "record", >>> "name": "Profile", >>> "fields": [ >>> {"name": "nest1", "type": [{ "type": "array", "items": "Nest1"}, >>> "null"]}, >>> {"name": "nest2", "type": [{ "type": "array", "items": "Nest2"}, >>> "null"]}, >>> {"name": "idd", "type": "long"} >>> } >>> >>> { >>> "namespace": "avro.parquet.model", >>> "type": "record", >>> "name": "Nest1", >>> "fields": [ >>> {"name": "ts", "type": "int"}, >>> {"name": "mpi", "type": "int"}, >>> {"name": "api1", "type": ["int", "null"], "default": -1}, >>> {"name": "api2", "type": ["long", "null"], "default": -1}, >>> {"name": "api3", "type": ["int", "null"], "default": -1} >>> } >>> >>> { >>> "namespace": "avro.parquet.model", >>> "type": "record", >>> "name": "Nest2", >>> "fields": [ >>> {"name": "nest1", "type": "Nest1"}, >>> {"name": "ts", "type": "int"} ] >>> } >>> >>> Basically there are two nested tables, nest1 and nest2 in the table, >>> Profile. In Nest2, it has a member of type Nest1. However, when I try to >>> get access to the data with the following projection schema: >>> >>> { >>> "type": "record", >>> "name": "Profile", >>> "namespace": "avro.parquet.model", >>> "fields": [ >>> { >>> "name": "nest2", >>> "type": [ >>> { >>> "type": "array", >>> "items": { >>> "type": "record", >>> "name": "Nest2", >>> "fields": [ >>> { >>> "name": "nest1", >>> "type": { >>> "type": "record", >>> "name": "Nest1", >>> "fields": [ >>> { >>> "name": "mpi", >>> "type": "int" >>> }, >>> { >>> "name": "ts", >>> "type": "int" >>> } >>> ] >>> } >>> } >>> ] >>> } >>> } >>> ] >>> } >>> ] >>> } >>> >>> But I get the error message, >>> >>> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema >>> mismatch. >>> Avro field 'mpi' not found. >>> at >>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133) >>> >>> .... >>> >>> >>> When checking the code, I found that the function, >>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type, >>> >>> Schema) returns false, though these two parameters (i.e., type and >>> schema) >>> are compatible. I am wondering if it is a bug, or I missed something >>> here. >>> >>> Thanks, >>> Yan >>> >>> >> >
