Re: Does Avro-Parquet support deep nesting?

Cheng Lian Mon, 21 Sep 2015 13:02:45 -0700

Hm... unfortunately the inline image fails to show up :(


On 9/21/15 12:06 PM, Yan Qi wrote:

Hi Cheng,

I am now using parquet-mr-1.8.0. Though I doubt it could solve theissue. I had a further investigation by trying the following query(see the image below)


Inline image 1

I added one more attribute in the projection schema, which is underNest2 directly. Surprisingly the org.apache.parquet.io<http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.


However, there is another exception, as

org.apache.parquet.io.ParquetDecodingException: Can not read value at0 in block -1 in file file:/path/to/part-00011-avro-parquet-100atorg.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)

at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
...

Caused by: org.apache.parquet.io.ParquetDecodingException: Therequested schema is not compatible with the file schema. incompatibletypes: required group nest2 (LIST) {

...

I am a lit bit confused by the result. Could you help me understandthis a bit?


Thanks,
Yan

On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]<mailto:[email protected]>> wrote:


    Sorry, missed link of my 2nd test case:
    
https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155

    Cheng

    On 9/21/15 10:52 AM, Cheng Lian wrote:

        I only checked the schema part while writing my last reply.
        However, after some more testing, I found that actually
        parquet-avro 1.7.0 can correctly decode your data, and
        "isElementType" does return true because of this "if" branch
        
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522>.
        May I know the version of parquet-avro/parquet-mr you were
        working with?

        Here are my tests (in Scala, with parquet-mr 1.7.0):

        - Case 1
        
<https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135>:
        Reading/writing a raw Parquet file containing Avro array of
        single-element records.
        - Case 2: The Profile example you provided. Avro definitions
        can be found here
        
<https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35>.



        Is it possible that the Profile example schema you provided
        doesn't fully demonstrate the nature of your problematic data?

        Cheng

        On 9/21/15 10:04 AM, Yan Qi wrote:

            Hi Cheng,

            Thanks a lot for your help!

            A quick question: do we need to regenerate the data to
            reflect this change?

            Thanks,
            Yan

            On Sat, Sep 19, 2015 at 12:55 AM, Cheng
            Lian<[email protected] <mailto:[email protected]>>
            wrote:

                To be more specific, the Parquet schema converted from
                the Avro schema is:

                message avro.parquet.model.Profile {
                   required group nest2 (LIST) {
                     repeated group array {
                       required group nest1 {
                         required int32 mpi;
                         required int32 ts;
                       }
                     }
                   }
                }

                The second level "array" field should be viewed as
                element type according
                to the following LIST backwards-compatibility rule:

                    If the repeated field is a group with one field
                and is named either
                    |array| or uses the |LIST|-annotated group's name
                with |_tuple|
                    appended then the repeated type is the element
                type and elements are
                    required.

                However, this rule is only handled in
                AvroSchemaConverter but not
                AvroIndexedRecordConverter. Thus, "array" is
                interpreted as the element
                wrapper group, and the element type of the nest2 is
                considered to be

                required group nest1 {
                   required int32 mpi;
                   required int32 ts;
                }

                rather than

                required group Nest2 {
                   required group nest1 {
                     required int32 mpi;
                     required int32 ts;
                   }
                }

                This PR should probably fix your issue (although I
                still have some
                questions about the testing code)
                https://github.com/apache/parquet-mr/pull/264

                BTW, recently released Spark SQL 1.5 should be able to
                read such Parquet
                files, but you need some extra work if you want to
                convert Spark SQL row
                objects to Avro records. This PR fixed this issue for
                Spark SQL:
                https://github.com/apache/spark/pull/8361

                Cheng


                On 9/19/15 12:40 AM, Cheng Lian wrote:

                    You are hitting PARQUET-364
                    https://issues.apache.org/jira/browse/PARQUET-364

                    Although the symptoms are different, they share
                    the same root cause.

                    Cheng

                    On 9/18/15 6:13 PM, Yan Qi wrote:

                        Hi,

                        The data I am working on have the following
                        schema:

                        {
                        "namespace": "avro.parquet.model",
                        "type": "record",
                        "name": "Profile",
                        "fields": [
                        {"name": "nest1", "type": [{ "type": "array",
                        "items": "Nest1"},
                        "null"]},
                        {"name": "nest2", "type": [{ "type": "array",
                        "items": "Nest2"},
                        "null"]},
                        {"name": "idd", "type": "long"}
                        }

                        {
                        "namespace": "avro.parquet.model",
                        "type": "record",
                        "name": "Nest1",
                        "fields": [
                        {"name": "ts", "type": "int"},
                        {"name": "mpi", "type": "int"},
                        {"name": "api1", "type": ["int", "null"],
                        "default": -1},
                        {"name": "api2", "type": ["long", "null"],
                        "default": -1},
                        {"name": "api3", "type": ["int", "null"],
                        "default": -1}
                        }

                        {
                        "namespace": "avro.parquet.model",
                        "type": "record",
                        "name": "Nest2",
                        "fields": [
                        {"name": "nest1", "type": "Nest1"},
                        {"name": "ts", "type": "int"} ]
                        }

                        Basically there are two nested tables, nest1
                        and nest2 in the table,
                        Profile. In Nest2, it has a member of type
                        Nest1. However, when I try to
                        get access to the data with the following
                        projection schema:

                        {
                            "type": "record",
                            "name": "Profile",
                            "namespace": "avro.parquet.model",
                            "fields": [
                              {
                                "name": "nest2",
                                "type": [
                                  {
                                    "type": "array",
                                    "items": {
                                      "type": "record",
                                      "name": "Nest2",
                                      "fields": [
                                        {
                                          "name": "nest1",
                                          "type": {
                                            "type": "record",
                                            "name": "Nest1",
                                            "fields": [
                                              {
                                                "name": "mpi",
                                                "type": "int"
                                              },
                                              {
                                                "name": "ts",
                                                "type": "int"
                                              }
                                            ]
                                          }
                                        }
                                      ]
                                    }
                                  }
                                ]
                              }
                            ]
                        }

                        But I get the error message,

                        org.apache.parquet.io
                        <http://org.apache.parquet.io>.InvalidRecordException:
                        Parquet/Avro schema
                        mismatch.
                        Avro field 'mpi' not found.
                        at
                        
org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)

                        ....


                        When checking the code, I found that the function,
                        
org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,

                        Schema) returns false, though these two
                        parameters (i.e., type and
                        schema)
                        are compatible. I am wondering if it is a bug,
                        or I missed something
                        here.

                        Thanks,
                        Yan

Re: Does Avro-Parquet support deep nesting?

Reply via email to