Hi Cheng,

I am now using parquet-mr-1.8.0. Though I doubt it could solve the issue. I
had a further investigation by trying the following query (see the image
below)

[image: Inline image 1]

I added one more attribute in the projection schema, which is under Nest2
directly. Surprisingly the org.apache.parquet.io.InvalidRecordExceptionion
is gone.

However, there is another exception, as

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/path/to/part-00011-avro-parquet-100
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
...
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: required
group nest2 (LIST) {
...

I am a lit bit confused by the result. Could you help me understand this a
bit?

Thanks,
Yan





On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]> wrote:

> Sorry, missed link of my 2nd test case:
> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155
>
> Cheng
>
> On 9/21/15 10:52 AM, Cheng Lian wrote:
>
>> I only checked the schema part while writing my last reply. However,
>> after some more testing, I found that actually parquet-avro 1.7.0 can
>> correctly decode your data, and "isElementType" does return true because of
>> this "if" branch <
>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522>.
>> May I know the version of parquet-avro/parquet-mr you were working with?
>>
>> Here are my tests (in Scala, with parquet-mr 1.7.0):
>>
>> - Case 1 <
>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135>:
>> Reading/writing a raw Parquet file containing Avro array of single-element
>> records.
>> - Case 2: The Profile example you provided. Avro definitions can be found
>> here <
>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35
>> >.
>>
>>
>> Is it possible that the Profile example schema you provided doesn't fully
>> demonstrate the nature of your problematic data?
>>
>> Cheng
>>
>> On 9/21/15 10:04 AM, Yan Qi wrote:
>>
>>> Hi Cheng,
>>>
>>> Thanks a lot for your help!
>>>
>>> A quick question: do we need to regenerate the data to reflect this
>>> change?
>>>
>>> Thanks,
>>> Yan
>>>
>>> On Sat, Sep 19, 2015 at 12:55 AM, Cheng Lian<[email protected]>
>>> wrote:
>>>
>>> To be more specific, the Parquet schema converted from the Avro schema
>>>> is:
>>>>
>>>> message avro.parquet.model.Profile {
>>>>    required group nest2 (LIST) {
>>>>      repeated group array {
>>>>        required group nest1 {
>>>>          required int32 mpi;
>>>>          required int32 ts;
>>>>        }
>>>>      }
>>>>    }
>>>> }
>>>>
>>>> The second level "array" field should be viewed as element type
>>>> according
>>>> to the following LIST backwards-compatibility rule:
>>>>
>>>>     If the repeated field is a group with one field and is named either
>>>>     |array| or uses the |LIST|-annotated group's name with |_tuple|
>>>>     appended then the repeated type is the element type and elements are
>>>>     required.
>>>>
>>>> However, this rule is only handled in AvroSchemaConverter but not
>>>> AvroIndexedRecordConverter. Thus, "array" is interpreted as the element
>>>> wrapper group, and the element type of the nest2 is considered to be
>>>>
>>>> required group nest1 {
>>>>    required int32 mpi;
>>>>    required int32 ts;
>>>> }
>>>>
>>>> rather than
>>>>
>>>> required group Nest2 {
>>>>    required group nest1 {
>>>>      required int32 mpi;
>>>>      required int32 ts;
>>>>    }
>>>> }
>>>>
>>>> This PR should probably fix your issue (although I still have some
>>>> questions about the testing code)
>>>> https://github.com/apache/parquet-mr/pull/264
>>>>
>>>> BTW, recently released Spark SQL 1.5 should be able to read such Parquet
>>>> files, but you need some extra work if you want to convert Spark SQL row
>>>> objects to Avro records. This PR fixed this issue for Spark SQL:
>>>> https://github.com/apache/spark/pull/8361
>>>>
>>>> Cheng
>>>>
>>>>
>>>> On 9/19/15 12:40 AM, Cheng Lian wrote:
>>>>
>>>> You are hitting PARQUET-364
>>>>> https://issues.apache.org/jira/browse/PARQUET-364
>>>>>
>>>>> Although the symptoms are different, they share the same root cause.
>>>>>
>>>>> Cheng
>>>>>
>>>>> On 9/18/15 6:13 PM, Yan Qi wrote:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> The data I am working on have the following schema:
>>>>>>
>>>>>> {
>>>>>> "namespace": "avro.parquet.model",
>>>>>> "type": "record",
>>>>>> "name": "Profile",
>>>>>> "fields": [
>>>>>> {"name": "nest1", "type": [{ "type": "array", "items": "Nest1"},
>>>>>> "null"]},
>>>>>> {"name": "nest2", "type": [{ "type": "array", "items": "Nest2"},
>>>>>> "null"]},
>>>>>> {"name": "idd", "type": "long"}
>>>>>> }
>>>>>>
>>>>>> {
>>>>>> "namespace": "avro.parquet.model",
>>>>>> "type": "record",
>>>>>> "name": "Nest1",
>>>>>> "fields": [
>>>>>> {"name": "ts", "type": "int"},
>>>>>> {"name": "mpi", "type": "int"},
>>>>>> {"name": "api1", "type": ["int", "null"], "default": -1},
>>>>>> {"name": "api2", "type": ["long", "null"], "default": -1},
>>>>>> {"name": "api3", "type": ["int", "null"], "default": -1}
>>>>>> }
>>>>>>
>>>>>> {
>>>>>> "namespace": "avro.parquet.model",
>>>>>> "type": "record",
>>>>>> "name": "Nest2",
>>>>>> "fields": [
>>>>>> {"name": "nest1", "type": "Nest1"},
>>>>>> {"name": "ts", "type": "int"} ]
>>>>>> }
>>>>>>
>>>>>> Basically there are two nested tables, nest1 and nest2 in the table,
>>>>>> Profile. In Nest2, it has a member of type Nest1. However, when I try
>>>>>> to
>>>>>> get access to the data with the following projection schema:
>>>>>>
>>>>>> {
>>>>>>     "type": "record",
>>>>>>     "name": "Profile",
>>>>>>     "namespace": "avro.parquet.model",
>>>>>>     "fields": [
>>>>>>       {
>>>>>>         "name": "nest2",
>>>>>>         "type": [
>>>>>>           {
>>>>>>             "type": "array",
>>>>>>             "items": {
>>>>>>               "type": "record",
>>>>>>               "name": "Nest2",
>>>>>>               "fields": [
>>>>>>                 {
>>>>>>                   "name": "nest1",
>>>>>>                   "type": {
>>>>>>                     "type": "record",
>>>>>>                     "name": "Nest1",
>>>>>>                     "fields": [
>>>>>>                       {
>>>>>>                         "name": "mpi",
>>>>>>                         "type": "int"
>>>>>>                       },
>>>>>>                       {
>>>>>>                         "name": "ts",
>>>>>>                         "type": "int"
>>>>>>                       }
>>>>>>                     ]
>>>>>>                   }
>>>>>>                 }
>>>>>>               ]
>>>>>>             }
>>>>>>           }
>>>>>>         ]
>>>>>>       }
>>>>>>     ]
>>>>>> }
>>>>>>
>>>>>> But I get the error message,
>>>>>>
>>>>>> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
>>>>>> mismatch.
>>>>>> Avro field 'mpi' not found.
>>>>>> at
>>>>>>
>>>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
>>>>>>
>>>>>> ....
>>>>>>
>>>>>>
>>>>>> When checking the code, I found that the function,
>>>>>>
>>>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
>>>>>>
>>>>>> Schema) returns false, though these two parameters (i.e., type and
>>>>>> schema)
>>>>>> are compatible. I am wondering if it is a bug, or I missed something
>>>>>> here.
>>>>>>
>>>>>> Thanks,
>>>>>> Yan
>>>>>>
>>>>>>
>>>>>>
>>
>

Reply via email to