Here are my tests (in Scala, with parquet-mr 1.7.0):
- Case 1 <https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135><https://github.com/liancheng/parquet-compat/blob/parquet-364/profile-example/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135>: Reading/writing a raw Parquet file containing Avro array of single-element records. - <https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155>Case 2: The Profile example you provided. Avro definitions can be found here <https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35>.
Is it possible that the Profile example schema you provided doesn't fully demonstrate the nature of your problematic data?
Cheng On 9/21/15 10:04 AM, Yan Qi wrote:
Hi Cheng, Thanks a lot for your help! A quick question: do we need to regenerate the data to reflect this change? Thanks, Yan On Sat, Sep 19, 2015 at 12:55 AM, Cheng Lian <[email protected]> wrote:To be more specific, the Parquet schema converted from the Avro schema is: message avro.parquet.model.Profile { required group nest2 (LIST) { repeated group array { required group nest1 { required int32 mpi; required int32 ts; } } } } The second level "array" field should be viewed as element type according to the following LIST backwards-compatibility rule: If the repeated field is a group with one field and is named either |array| or uses the |LIST|-annotated group's name with |_tuple| appended then the repeated type is the element type and elements are required. However, this rule is only handled in AvroSchemaConverter but not AvroIndexedRecordConverter. Thus, "array" is interpreted as the element wrapper group, and the element type of the nest2 is considered to be required group nest1 { required int32 mpi; required int32 ts; } rather than required group Nest2 { required group nest1 { required int32 mpi; required int32 ts; } } This PR should probably fix your issue (although I still have some questions about the testing code) https://github.com/apache/parquet-mr/pull/264 BTW, recently released Spark SQL 1.5 should be able to read such Parquet files, but you need some extra work if you want to convert Spark SQL row objects to Avro records. This PR fixed this issue for Spark SQL: https://github.com/apache/spark/pull/8361 Cheng On 9/19/15 12:40 AM, Cheng Lian wrote:You are hitting PARQUET-364 https://issues.apache.org/jira/browse/PARQUET-364 Although the symptoms are different, they share the same root cause. Cheng On 9/18/15 6:13 PM, Yan Qi wrote:Hi, The data I am working on have the following schema: { "namespace": "avro.parquet.model", "type": "record", "name": "Profile", "fields": [ {"name": "nest1", "type": [{ "type": "array", "items": "Nest1"}, "null"]}, {"name": "nest2", "type": [{ "type": "array", "items": "Nest2"}, "null"]}, {"name": "idd", "type": "long"} } { "namespace": "avro.parquet.model", "type": "record", "name": "Nest1", "fields": [ {"name": "ts", "type": "int"}, {"name": "mpi", "type": "int"}, {"name": "api1", "type": ["int", "null"], "default": -1}, {"name": "api2", "type": ["long", "null"], "default": -1}, {"name": "api3", "type": ["int", "null"], "default": -1} } { "namespace": "avro.parquet.model", "type": "record", "name": "Nest2", "fields": [ {"name": "nest1", "type": "Nest1"}, {"name": "ts", "type": "int"} ] } Basically there are two nested tables, nest1 and nest2 in the table, Profile. In Nest2, it has a member of type Nest1. However, when I try to get access to the data with the following projection schema: { "type": "record", "name": "Profile", "namespace": "avro.parquet.model", "fields": [ { "name": "nest2", "type": [ { "type": "array", "items": { "type": "record", "name": "Nest2", "fields": [ { "name": "nest1", "type": { "type": "record", "name": "Nest1", "fields": [ { "name": "mpi", "type": "int" }, { "name": "ts", "type": "int" } ] } } ] } } ] } ] } But I get the error message, org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'mpi' not found. at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133) .... When checking the code, I found that the function, org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type, Schema) returns false, though these two parameters (i.e., type and schema) are compatible. I am wondering if it is a bug, or I missed something here. Thanks, Yan
