Hi Cheng, I believe that the patch (https://github.com/apache/parquet-mr/pull/264) should solve my problem.
Thanks for your help, :) Cheers, Yan On Mon, Sep 21, 2015 at 1:54 PM, Cheng Lian <[email protected]> wrote: > Could you please paste the schemas as a GitHub Gist? I couldn't get your > idea since the message is pretty much garbled... > > And could you please also share the schema of the physical Parquet file > you are trying to read? You may get it with the parquet-schema command line > tool. > > Cheng > > On 9/21/15 1:36 PM, Yan Qi wrote: > >> Hi Cheng, >> >> The projection schema in the image are below: >> >> { { >> >> "type": "record", "type": "record", >> >> "name": "Profile", "name": "Profile", >> >> "namespace": "avro.parquet.model", "namespace": >> "avro.parquet.model", >> "fields": [ "fields": [ >> >> { { >> >> "name": "nest2", "name": "nest2", >> >> "type": [ "type": [ >> >> { { >> >> "type": "array", "type": >> "array", >> >> "items": { "items": { >> >> "type": "record", "type": >> "record", >> "name": "Nest2", "name": >> "Nest2", >> "fields": [ "fields": [ >> >> *{ * >> * "name": "ts", * >> * "type": "int" * >> * }, * >> { ===> { >> >> >> "name": "nest1", "name": >> "nest1", >> "type": { >> "type": { >> >> "type": "record", >> "type": >> "record", >> "name": "Nest1", >> "name": >> "Nest1", >> "fields": [ >> "fields": [ >> { { >> >> "name": "mpi", >> "name": "mpi", >> "type": "int" >> "type": "int" >> }, }, >> >> { { >> >> "name": "ts", >> "name": "ts", >> "type": "int" >> "type": "int" >> } } >> >> ] } } ] } } ] } ] ] } } >> ] >> } } ] } ] >> } } >> >> >> >> I just realized that in my previous experiment, I made a mistake. Then I >> reran the reading job with the modified schema on the right hand side (see >> above), and it succeeded. >> >> I am wondering how this can work. >> >> Thanks, >> Yan >> >> >> >> On Mon, Sep 21, 2015 at 1:02 PM, Cheng Lian <[email protected]> >> wrote: >> >> Also, could you please provide the full exception stack trace? >>> >>> Cheng >>> >>> >>> On 9/21/15 1:01 PM, Cheng Lian wrote: >>> >>> Hm... unfortunately the inline image fails to show up :( >>>> >>>> On 9/21/15 12:06 PM, Yan Qi wrote: >>>> >>>> Hi Cheng, >>>>> >>>>> I am now using parquet-mr-1.8.0. Though I doubt it could solve the >>>>> issue. I had a further investigation by trying the following query >>>>> (see the >>>>> image below) >>>>> >>>>> Inline image 1 >>>>> >>>>> I added one more attribute in the projection schema, which is under >>>>> Nest2 directly. Surprisingly the org.apache.parquet.io < >>>>> http://org.apache.parquet.io>.InvalidRecordExceptionion is gone. >>>>> >>>>> However, there is another exception, as >>>>> >>>>> org.apache.parquet.io.ParquetDecodingException: Can not read value at >>>>> 0 >>>>> in block -1 in file file:/path/to/part-00011-avro-parquet-100 >>>>> at >>>>> >>>>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243) >>>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125) >>>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129) >>>>> ... >>>>> Caused by: org.apache.parquet.io.ParquetDecodingException: The >>>>> requested schema is not compatible with the file schema. incompatible >>>>> types: required group nest2 (LIST) { >>>>> ... >>>>> >>>>> I am a lit bit confused by the result. Could you help me understand >>>>> this >>>>> a bit? >>>>> >>>>> Thanks, >>>>> Yan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Sorry, missed link of my 2nd test case: >>>>> >>>>> >>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155 >>>>> >>>>> Cheng >>>>> >>>>> On 9/21/15 10:52 AM, Cheng Lian wrote: >>>>> >>>>> I only checked the schema part while writing my last reply. >>>>> However, after some more testing, I found that actually >>>>> parquet-avro 1.7.0 can correctly decode your data, and >>>>> "isElementType" does return true because of this "if" branch >>>>> < >>>>> >>>>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522 >>>>> >>>>>> . >>>>>> >>>>> May I know the version of parquet-avro/parquet-mr you were >>>>> working with? >>>>> >>>>> Here are my tests (in Scala, with parquet-mr 1.7.0): >>>>> >>>>> - Case 1 >>>>> < >>>>> >>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135 >>>>> >>>>>> : >>>>>> >>>>> Reading/writing a raw Parquet file containing Avro array of >>>>> single-element records. >>>>> - Case 2: The Profile example you provided. Avro definitions >>>>> can be found here >>>>> < >>>>> >>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35 >>>>> >>>>>> . >>>>>> >>>>> >>>>> >>>>> Is it possible that the Profile example schema you provided >>>>> doesn't fully demonstrate the nature of your problematic data? >>>>> >>>>> Cheng >>>>> >>>>> On 9/21/15 10:04 AM, Yan Qi wrote: >>>>> >>>>> Hi Cheng, >>>>> >>>>> Thanks a lot for your help! >>>>> >>>>> A quick question: do we need to regenerate the data to >>>>> reflect this change? >>>>> >>>>> Thanks, >>>>> Yan >>>>> >>>>> On Sat, Sep 19, 2015 at 12:55 AM, Cheng >>>>> Lian<[email protected]> wrote: >>>>> >>>>> To be more specific, the Parquet schema converted >>>>> from the Avro schema is: >>>>> >>>>> message avro.parquet.model.Profile { >>>>> required group nest2 (LIST) { >>>>> repeated group array { >>>>> required group nest1 { >>>>> required int32 mpi; >>>>> required int32 ts; >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> The second level "array" field should be viewed as >>>>> element type according >>>>> to the following LIST backwards-compatibility rule: >>>>> >>>>> If the repeated field is a group with one field >>>>> and is named either >>>>> |array| or uses the |LIST|-annotated group's name >>>>> with |_tuple| >>>>> appended then the repeated type is the element >>>>> type and elements are >>>>> required. >>>>> >>>>> However, this rule is only handled in >>>>> AvroSchemaConverter but not >>>>> AvroIndexedRecordConverter. Thus, "array" is >>>>> interpreted as the element >>>>> wrapper group, and the element type of the nest2 is >>>>> considered to be >>>>> >>>>> required group nest1 { >>>>> required int32 mpi; >>>>> required int32 ts; >>>>> } >>>>> >>>>> rather than >>>>> >>>>> required group Nest2 { >>>>> required group nest1 { >>>>> required int32 mpi; >>>>> required int32 ts; >>>>> } >>>>> } >>>>> >>>>> This PR should probably fix your issue (although I >>>>> still have some >>>>> questions about the testing code) >>>>> https://github.com/apache/parquet-mr/pull/264 >>>>> >>>>> BTW, recently released Spark SQL 1.5 should be able >>>>> to read such Parquet >>>>> files, but you need some extra work if you want to >>>>> convert Spark SQL row >>>>> objects to Avro records. This PR fixed this issue for >>>>> Spark SQL: >>>>> https://github.com/apache/spark/pull/8361 >>>>> >>>>> Cheng >>>>> >>>>> >>>>> On 9/19/15 12:40 AM, Cheng Lian wrote: >>>>> >>>>> You are hitting PARQUET-364 >>>>> https://issues.apache.org/jira/browse/PARQUET-364 >>>>> >>>>> Although the symptoms are different, they share >>>>> the same root cause. >>>>> >>>>> Cheng >>>>> >>>>> On 9/18/15 6:13 PM, Yan Qi wrote: >>>>> >>>>> Hi, >>>>> >>>>> The data I am working on have the following >>>>> schema: >>>>> >>>>> { >>>>> "namespace": "avro.parquet.model", >>>>> "type": "record", >>>>> "name": "Profile", >>>>> "fields": [ >>>>> {"name": "nest1", "type": [{ "type": "array", >>>>> "items": "Nest1"}, >>>>> "null"]}, >>>>> {"name": "nest2", "type": [{ "type": "array", >>>>> "items": "Nest2"}, >>>>> "null"]}, >>>>> {"name": "idd", "type": "long"} >>>>> } >>>>> >>>>> { >>>>> "namespace": "avro.parquet.model", >>>>> "type": "record", >>>>> "name": "Nest1", >>>>> "fields": [ >>>>> {"name": "ts", "type": "int"}, >>>>> {"name": "mpi", "type": "int"}, >>>>> {"name": "api1", "type": ["int", "null"], >>>>> "default": -1}, >>>>> {"name": "api2", "type": ["long", "null"], >>>>> "default": -1}, >>>>> {"name": "api3", "type": ["int", "null"], >>>>> "default": -1} >>>>> } >>>>> >>>>> { >>>>> "namespace": "avro.parquet.model", >>>>> "type": "record", >>>>> "name": "Nest2", >>>>> "fields": [ >>>>> {"name": "nest1", "type": "Nest1"}, >>>>> {"name": "ts", "type": "int"} ] >>>>> } >>>>> >>>>> Basically there are two nested tables, nest1 >>>>> and nest2 in the table, >>>>> Profile. In Nest2, it has a member of type >>>>> Nest1. However, when I try to >>>>> get access to the data with the following >>>>> projection schema: >>>>> >>>>> { >>>>> "type": "record", >>>>> "name": "Profile", >>>>> "namespace": "avro.parquet.model", >>>>> "fields": [ >>>>> { >>>>> "name": "nest2", >>>>> "type": [ >>>>> { >>>>> "type": "array", >>>>> "items": { >>>>> "type": "record", >>>>> "name": "Nest2", >>>>> "fields": [ >>>>> { >>>>> "name": "nest1", >>>>> "type": { >>>>> "type": "record", >>>>> "name": "Nest1", >>>>> "fields": [ >>>>> { >>>>> "name": "mpi", >>>>> "type": "int" >>>>> }, >>>>> { >>>>> "name": "ts", >>>>> "type": "int" >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> >>>>> But I get the error message, >>>>> >>>>> org.apache.parquet.io >>>>> <http://org.apache.parquet.io >>>>> >>>>>> .InvalidRecordException: >>>>>> >>>>> Parquet/Avro schema >>>>> mismatch. >>>>> Avro field 'mpi' not found. >>>>> at >>>>> >>>>> >>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133) >>>>> >>>>> .... >>>>> >>>>> >>>>> When checking the code, I found that the >>>>> function, >>>>> >>>>> >>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type, >>>>> >>>>> Schema) returns false, though these two >>>>> parameters (i.e., type and >>>>> schema) >>>>> are compatible. I am wondering if it is a >>>>> bug, or I missed something >>>>> here. >>>>> >>>>> Thanks, >>>>> Yan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >
