Re: Does Avro-Parquet support deep nesting?

Yan Qi Tue, 22 Sep 2015 10:44:00 -0700

Hi Cheng,

I believe that the patch (https://github.com/apache/parquet-mr/pull/264)
should solve my problem.


Thanks for your help, :)

Cheers,
Yan

On Mon, Sep 21, 2015 at 1:54 PM, Cheng Lian <[email protected]> wrote:

> Could you please paste the schemas as a GitHub Gist? I couldn't get your
> idea since the message is pretty much garbled...
>
> And could you please also share the schema of the physical Parquet file
> you are trying to read? You may get it with the parquet-schema command line
> tool.
>
> Cheng
>
> On 9/21/15 1:36 PM, Yan Qi wrote:
>
>> Hi Cheng,
>>
>> The projection schema in the image  are below:
>>
>> {                                                 {
>>
>>    "type": "record",                                 "type": "record",
>>
>>    "name": "Profile",                               "name": "Profile",
>>
>>    "namespace": "avro.parquet.model",               "namespace":
>> "avro.parquet.model",
>>    "fields": [                                       "fields": [
>>
>>      {                                                 {
>>
>>        "name": "nest2",                                 "name": "nest2",
>>
>>        "type": [                                         "type": [
>>
>>          {                                                 {
>>
>>            "type": "array",                                 "type":
>> "array",
>>
>>            "items": {                                       "items": {
>>
>>              "type": "record",                                 "type":
>> "record",
>>              "name": "Nest2",                                 "name":
>> "Nest2",
>>              "fields": [                                       "fields": [
>>
>>              *{                             *
>> *            "name": "ts",             *
>> *            "type": "int"             *
>> *            },                 *
>>                {                             ===>        {
>>
>>
>>                  "name": "nest1",                                 "name":
>> "nest1",
>>                  "type": {
>>  "type": {
>>
>>                    "type": "record",
>>  "type":
>> "record",
>>                    "name": "Nest1",
>>  "name":
>> "Nest1",
>>                    "fields": [
>> "fields": [
>>                      {                                                 {
>>
>>                        "name": "mpi",
>> "name": "mpi",
>>                        "type": "int"
>> "type": "int"
>>                      },                                               },
>>
>>                      {                                                 {
>>
>>                        "name": "ts",
>> "name": "ts",
>>                        "type": "int"
>> "type": "int"
>>                      }                                                 }
>>
>>                    ] } } ]  } } ] } ]                               ] } }
>> ]
>>   } } ] } ]
>> }                                                 }
>>
>>
>>
>> I just realized that in my previous experiment, I made a mistake. Then I
>> reran the reading job with the modified schema on the right hand side (see
>> above), and it succeeded.
>>
>> I am wondering how this can work.
>>
>> Thanks,
>> Yan
>>
>>
>>
>> On Mon, Sep 21, 2015 at 1:02 PM, Cheng Lian <[email protected]>
>> wrote:
>>
>> Also, could you please provide the full exception stack trace?
>>>
>>> Cheng
>>>
>>>
>>> On 9/21/15 1:01 PM, Cheng Lian wrote:
>>>
>>> Hm... unfortunately the inline image fails to show up :(
>>>>
>>>> On 9/21/15 12:06 PM, Yan Qi wrote:
>>>>
>>>> Hi Cheng,
>>>>>
>>>>> I am now using parquet-mr-1.8.0. Though I doubt it could solve the
>>>>> issue. I had a further investigation by trying the following query
>>>>> (see the
>>>>> image below)
>>>>>
>>>>> Inline image 1
>>>>>
>>>>> I added one more attribute in the projection schema, which is under
>>>>> Nest2 directly. Surprisingly the org.apache.parquet.io <
>>>>> http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.
>>>>>
>>>>> However, there is another exception, as
>>>>>
>>>>> org.apache.parquet.io.ParquetDecodingException: Can not read value at
>>>>> 0
>>>>> in block -1 in file file:/path/to/part-00011-avro-parquet-100
>>>>> at
>>>>>
>>>>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>>>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
>>>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
>>>>> ...
>>>>> Caused by: org.apache.parquet.io.ParquetDecodingException: The
>>>>> requested schema is not compatible with the file schema. incompatible
>>>>> types: required group nest2 (LIST) {
>>>>> ...
>>>>>
>>>>> I am a lit bit confused by the result. Could you help me understand
>>>>> this
>>>>> a bit?
>>>>>
>>>>> Thanks,
>>>>> Yan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>>      Sorry, missed link of my 2nd test case:
>>>>>
>>>>>
>>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155
>>>>>
>>>>>      Cheng
>>>>>
>>>>>      On 9/21/15 10:52 AM, Cheng Lian wrote:
>>>>>
>>>>>          I only checked the schema part while writing my last reply.
>>>>>          However, after some more testing, I found that actually
>>>>>          parquet-avro 1.7.0 can correctly decode your data, and
>>>>>          "isElementType" does return true because of this "if" branch
>>>>>          <
>>>>>
>>>>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522
>>>>>
>>>>>> .
>>>>>>
>>>>>          May I know the version of parquet-avro/parquet-mr you were
>>>>>          working with?
>>>>>
>>>>>          Here are my tests (in Scala, with parquet-mr 1.7.0):
>>>>>
>>>>>          - Case 1
>>>>>          <
>>>>>
>>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135
>>>>>
>>>>>> :
>>>>>>
>>>>>          Reading/writing a raw Parquet file containing Avro array of
>>>>>          single-element records.
>>>>>          - Case 2: The Profile example you provided. Avro definitions
>>>>>          can be found here
>>>>>          <
>>>>>
>>>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35
>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>>          Is it possible that the Profile example schema you provided
>>>>>          doesn't fully demonstrate the nature of your problematic data?
>>>>>
>>>>>          Cheng
>>>>>
>>>>>          On 9/21/15 10:04 AM, Yan Qi wrote:
>>>>>
>>>>>              Hi Cheng,
>>>>>
>>>>>              Thanks a lot for your help!
>>>>>
>>>>>              A quick question: do we need to regenerate the data to
>>>>>              reflect this change?
>>>>>
>>>>>              Thanks,
>>>>>              Yan
>>>>>
>>>>>              On Sat, Sep 19, 2015 at 12:55 AM, Cheng
>>>>>              Lian<[email protected]> wrote:
>>>>>
>>>>>                  To be more specific, the Parquet schema converted
>>>>>                  from the Avro schema is:
>>>>>
>>>>>                  message avro.parquet.model.Profile {
>>>>>                     required group nest2 (LIST) {
>>>>>                       repeated group array {
>>>>>                         required group nest1 {
>>>>>                           required int32 mpi;
>>>>>                           required int32 ts;
>>>>>                         }
>>>>>                       }
>>>>>                     }
>>>>>                  }
>>>>>
>>>>>                  The second level "array" field should be viewed as
>>>>>                  element type according
>>>>>                  to the following LIST backwards-compatibility rule:
>>>>>
>>>>>                      If the repeated field is a group with one field
>>>>>                  and is named either
>>>>>                      |array| or uses the |LIST|-annotated group's name
>>>>>                  with |_tuple|
>>>>>                      appended then the repeated type is the element
>>>>>                  type and elements are
>>>>>                      required.
>>>>>
>>>>>                  However, this rule is only handled in
>>>>>                  AvroSchemaConverter but not
>>>>>                  AvroIndexedRecordConverter. Thus, "array" is
>>>>>                  interpreted as the element
>>>>>                  wrapper group, and the element type of the nest2 is
>>>>>                  considered to be
>>>>>
>>>>>                  required group nest1 {
>>>>>                     required int32 mpi;
>>>>>                     required int32 ts;
>>>>>                  }
>>>>>
>>>>>                  rather than
>>>>>
>>>>>                  required group Nest2 {
>>>>>                     required group nest1 {
>>>>>                       required int32 mpi;
>>>>>                       required int32 ts;
>>>>>                     }
>>>>>                  }
>>>>>
>>>>>                  This PR should probably fix your issue (although I
>>>>>                  still have some
>>>>>                  questions about the testing code)
>>>>>                  https://github.com/apache/parquet-mr/pull/264
>>>>>
>>>>>                  BTW, recently released Spark SQL 1.5 should be able
>>>>>                  to read such Parquet
>>>>>                  files, but you need some extra work if you want to
>>>>>                  convert Spark SQL row
>>>>>                  objects to Avro records. This PR fixed this issue for
>>>>>                  Spark SQL:
>>>>>                  https://github.com/apache/spark/pull/8361
>>>>>
>>>>>                  Cheng
>>>>>
>>>>>
>>>>>                  On 9/19/15 12:40 AM, Cheng Lian wrote:
>>>>>
>>>>>                      You are hitting PARQUET-364
>>>>>                      https://issues.apache.org/jira/browse/PARQUET-364
>>>>>
>>>>>                      Although the symptoms are different, they share
>>>>>                      the same root cause.
>>>>>
>>>>>                      Cheng
>>>>>
>>>>>                      On 9/18/15 6:13 PM, Yan Qi wrote:
>>>>>
>>>>>                          Hi,
>>>>>
>>>>>                          The data I am working on have the following
>>>>>                          schema:
>>>>>
>>>>>                          {
>>>>>                          "namespace": "avro.parquet.model",
>>>>>                          "type": "record",
>>>>>                          "name": "Profile",
>>>>>                          "fields": [
>>>>>                          {"name": "nest1", "type": [{ "type": "array",
>>>>>                          "items": "Nest1"},
>>>>>                          "null"]},
>>>>>                          {"name": "nest2", "type": [{ "type": "array",
>>>>>                          "items": "Nest2"},
>>>>>                          "null"]},
>>>>>                          {"name": "idd", "type": "long"}
>>>>>                          }
>>>>>
>>>>>                          {
>>>>>                          "namespace": "avro.parquet.model",
>>>>>                          "type": "record",
>>>>>                          "name": "Nest1",
>>>>>                          "fields": [
>>>>>                          {"name": "ts", "type": "int"},
>>>>>                          {"name": "mpi", "type": "int"},
>>>>>                          {"name": "api1", "type": ["int", "null"],
>>>>>                          "default": -1},
>>>>>                          {"name": "api2", "type": ["long", "null"],
>>>>>                          "default": -1},
>>>>>                          {"name": "api3", "type": ["int", "null"],
>>>>>                          "default": -1}
>>>>>                          }
>>>>>
>>>>>                          {
>>>>>                          "namespace": "avro.parquet.model",
>>>>>                          "type": "record",
>>>>>                          "name": "Nest2",
>>>>>                          "fields": [
>>>>>                          {"name": "nest1", "type": "Nest1"},
>>>>>                          {"name": "ts", "type": "int"} ]
>>>>>                          }
>>>>>
>>>>>                          Basically there are two nested tables, nest1
>>>>>                          and nest2 in the table,
>>>>>                          Profile. In Nest2, it has a member of type
>>>>>                          Nest1. However, when I try to
>>>>>                          get access to the data with the following
>>>>>                          projection schema:
>>>>>
>>>>>                          {
>>>>>                              "type": "record",
>>>>>                              "name": "Profile",
>>>>>                              "namespace": "avro.parquet.model",
>>>>>                              "fields": [
>>>>>                                {
>>>>>                                  "name": "nest2",
>>>>>                                  "type": [
>>>>>                                    {
>>>>>                                      "type": "array",
>>>>>                                      "items": {
>>>>>                                        "type": "record",
>>>>>                                        "name": "Nest2",
>>>>>                                        "fields": [
>>>>>                                          {
>>>>>                                            "name": "nest1",
>>>>>                                            "type": {
>>>>>                                              "type": "record",
>>>>>                                              "name": "Nest1",
>>>>>                                              "fields": [
>>>>>                                                {
>>>>>                                                  "name": "mpi",
>>>>>                                                  "type": "int"
>>>>>                                                },
>>>>>                                                {
>>>>>                                                  "name": "ts",
>>>>>                                                  "type": "int"
>>>>>                                                }
>>>>>                                              ]
>>>>>                                            }
>>>>>                                          }
>>>>>                                        ]
>>>>>                                      }
>>>>>                                    }
>>>>>                                  ]
>>>>>                                }
>>>>>                              ]
>>>>>                          }
>>>>>
>>>>>                          But I get the error message,
>>>>>
>>>>>                          org.apache.parquet.io
>>>>>                          <http://org.apache.parquet.io
>>>>>
>>>>>> .InvalidRecordException:
>>>>>>
>>>>>                          Parquet/Avro schema
>>>>>                          mismatch.
>>>>>                          Avro field 'mpi' not found.
>>>>>                          at
>>>>>
>>>>>
>>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
>>>>>
>>>>>                          ....
>>>>>
>>>>>
>>>>>                          When checking the code, I found that the
>>>>>                          function,
>>>>>
>>>>>
>>>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
>>>>>
>>>>>                          Schema) returns false, though these two
>>>>>                          parameters (i.e., type and
>>>>>                          schema)
>>>>>                          are compatible. I am wondering if it is a
>>>>>                          bug, or I missed something
>>>>>                          here.
>>>>>
>>>>>                          Thanks,
>>>>>                          Yan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: Does Avro-Parquet support deep nesting?

Reply via email to