Re: Does Avro-Parquet support deep nesting?

Yan Qi Mon, 21 Sep 2015 10:06:06 -0700

Hi Cheng,

Thanks a lot for your help!


A quick question: do we need to regenerate the data to reflect this change?

Thanks,
Yan

On Sat, Sep 19, 2015 at 12:55 AM, Cheng Lian <[email protected]> wrote:

> To be more specific, the Parquet schema converted from the Avro schema is:
>
> message avro.parquet.model.Profile {
>   required group nest2 (LIST) {
>     repeated group array {
>       required group nest1 {
>         required int32 mpi;
>         required int32 ts;
>       }
>     }
>   }
> }
>
> The second level "array" field should be viewed as element type according
> to the following LIST backwards-compatibility rule:
>
>    If the repeated field is a group with one field and is named either
>    |array| or uses the |LIST|-annotated group's name with |_tuple|
>    appended then the repeated type is the element type and elements are
>    required.
>
> However, this rule is only handled in AvroSchemaConverter but not
> AvroIndexedRecordConverter. Thus, "array" is interpreted as the element
> wrapper group, and the element type of the nest2 is considered to be
>
> required group nest1 {
>   required int32 mpi;
>   required int32 ts;
> }
>
> rather than
>
> required group Nest2 {
>   required group nest1 {
>     required int32 mpi;
>     required int32 ts;
>   }
> }
>
> This PR should probably fix your issue (although I still have some
> questions about the testing code)
> https://github.com/apache/parquet-mr/pull/264
>
> BTW, recently released Spark SQL 1.5 should be able to read such Parquet
> files, but you need some extra work if you want to convert Spark SQL row
> objects to Avro records. This PR fixed this issue for Spark SQL:
> https://github.com/apache/spark/pull/8361
>
> Cheng
>
>
> On 9/19/15 12:40 AM, Cheng Lian wrote:
>
>> You are hitting PARQUET-364
>> https://issues.apache.org/jira/browse/PARQUET-364
>>
>> Although the symptoms are different, they share the same root cause.
>>
>> Cheng
>>
>> On 9/18/15 6:13 PM, Yan Qi wrote:
>>
>>> Hi,
>>>
>>> The data I am working on have the following schema:
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Profile",
>>> "fields": [
>>> {"name": "nest1", "type": [{ "type": "array", "items": "Nest1"},
>>> "null"]},
>>> {"name": "nest2", "type": [{ "type": "array", "items": "Nest2"},
>>> "null"]},
>>> {"name": "idd", "type": "long"}
>>> }
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Nest1",
>>> "fields": [
>>> {"name": "ts", "type": "int"},
>>> {"name": "mpi", "type": "int"},
>>> {"name": "api1", "type": ["int", "null"], "default": -1},
>>> {"name": "api2", "type": ["long", "null"], "default": -1},
>>> {"name": "api3", "type": ["int", "null"], "default": -1}
>>> }
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Nest2",
>>> "fields": [
>>> {"name": "nest1", "type": "Nest1"},
>>> {"name": "ts", "type": "int"} ]
>>> }
>>>
>>> Basically there are two nested tables, nest1 and nest2 in the table,
>>> Profile. In Nest2, it has a member of type Nest1. However, when I try to
>>> get access to the data with the following projection schema:
>>>
>>> {
>>>    "type": "record",
>>>    "name": "Profile",
>>>    "namespace": "avro.parquet.model",
>>>    "fields": [
>>>      {
>>>        "name": "nest2",
>>>        "type": [
>>>          {
>>>            "type": "array",
>>>            "items": {
>>>              "type": "record",
>>>              "name": "Nest2",
>>>              "fields": [
>>>                {
>>>                  "name": "nest1",
>>>                  "type": {
>>>                    "type": "record",
>>>                    "name": "Nest1",
>>>                    "fields": [
>>>                      {
>>>                        "name": "mpi",
>>>                        "type": "int"
>>>                      },
>>>                      {
>>>                        "name": "ts",
>>>                        "type": "int"
>>>                      }
>>>                    ]
>>>                  }
>>>                }
>>>              ]
>>>            }
>>>          }
>>>        ]
>>>      }
>>>    ]
>>> }
>>>
>>> But I get the error message,
>>>
>>> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
>>> mismatch.
>>> Avro field 'mpi' not found.
>>> at
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
>>>
>>> ....
>>>
>>>
>>> When checking the code, I found that the function,
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
>>>
>>> Schema) returns false, though these two parameters (i.e., type and
>>> schema)
>>> are compatible. I am wondering if it is a bug, or I missed something
>>> here.
>>>
>>> Thanks,
>>> Yan
>>>
>>>
>>
>

Re: Does Avro-Parquet support deep nesting?

Reply via email to