Re: Does Avro-Parquet support deep nesting?

Yan Qi Mon, 21 Sep 2015 13:37:31 -0700

Hi Cheng,

The projection schema in the image  are below:


{                                                 {

  "type": "record",                                 "type": "record",

  "name": "Profile",                               "name": "Profile",

  "namespace": "avro.parquet.model",               "namespace":
"avro.parquet.model",
  "fields": [                                       "fields": [

    {                                                 {

      "name": "nest2",                                 "name": "nest2",

      "type": [                                         "type": [

        {                                                 {

          "type": "array",                                 "type": "array",

          "items": {                                       "items": {

            "type": "record",                                 "type":
"record",
            "name": "Nest2",                                 "name":
"Nest2",
            "fields": [                                       "fields": [

            *{                             *
*            "name": "ts",             *
*            "type": "int"             *
*            },                 *
              {                             ===>        {

                "name": "nest1",                                 "name":
"nest1",
                "type": {                                         "type": {

                  "type": "record",                                 "type":
"record",
                  "name": "Nest1",                                 "name":
"Nest1",
                  "fields": [
"fields": [
                    {                                                 {

                      "name": "mpi",
"name": "mpi",
                      "type": "int"
"type": "int"
                    },                                               },

                    {                                                 {

                      "name": "ts",
"name": "ts",
                      "type": "int"
"type": "int"
                    }                                                 }

                  ] } } ]  } } ] } ]                               ] } } ]
 } } ] } ]
}                                                 }



I just realized that in my previous experiment, I made a mistake. Then I
reran the reading job with the modified schema on the right hand side (see
above), and it succeeded.

I am wondering how this can work.

Thanks,
Yan



On Mon, Sep 21, 2015 at 1:02 PM, Cheng Lian <[email protected]> wrote:

> Also, could you please provide the full exception stack trace?
>
> Cheng
>
>
> On 9/21/15 1:01 PM, Cheng Lian wrote:
>
>> Hm... unfortunately the inline image fails to show up :(
>>
>> On 9/21/15 12:06 PM, Yan Qi wrote:
>>
>>> Hi Cheng,
>>>
>>> I am now using parquet-mr-1.8.0. Though I doubt it could solve the
>>> issue. I had a further investigation by trying the following query (see the
>>> image below)
>>>
>>> Inline image 1
>>>
>>> I added one more attribute in the projection schema, which is under
>>> Nest2 directly. Surprisingly the org.apache.parquet.io <
>>> http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.
>>>
>>> However, there is another exception, as
>>>
>>> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0
>>> in block -1 in file file:/path/to/part-00011-avro-parquet-100
>>> at
>>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
>>> ...
>>> Caused by: org.apache.parquet.io.ParquetDecodingException: The
>>> requested schema is not compatible with the file schema. incompatible
>>> types: required group nest2 (LIST) {
>>> ...
>>>
>>> I am a lit bit confused by the result. Could you help me understand this
>>> a bit?
>>>
>>> Thanks,
>>> Yan
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Sorry, missed link of my 2nd test case:
>>>
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155
>>>
>>>     Cheng
>>>
>>>     On 9/21/15 10:52 AM, Cheng Lian wrote:
>>>
>>>         I only checked the schema part while writing my last reply.
>>>         However, after some more testing, I found that actually
>>>         parquet-avro 1.7.0 can correctly decode your data, and
>>>         "isElementType" does return true because of this "if" branch
>>>         <
>>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522
>>> >.
>>>         May I know the version of parquet-avro/parquet-mr you were
>>>         working with?
>>>
>>>         Here are my tests (in Scala, with parquet-mr 1.7.0):
>>>
>>>         - Case 1
>>>         <
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135
>>> >:
>>>         Reading/writing a raw Parquet file containing Avro array of
>>>         single-element records.
>>>         - Case 2: The Profile example you provided. Avro definitions
>>>         can be found here
>>>         <
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35
>>> >.
>>>
>>>
>>>
>>>         Is it possible that the Profile example schema you provided
>>>         doesn't fully demonstrate the nature of your problematic data?
>>>
>>>         Cheng
>>>
>>>         On 9/21/15 10:04 AM, Yan Qi wrote:
>>>
>>>             Hi Cheng,
>>>
>>>             Thanks a lot for your help!
>>>
>>>             A quick question: do we need to regenerate the data to
>>>             reflect this change?
>>>
>>>             Thanks,
>>>             Yan
>>>
>>>             On Sat, Sep 19, 2015 at 12:55 AM, Cheng
>>>             Lian<[email protected]> wrote:
>>>
>>>                 To be more specific, the Parquet schema converted
>>>                 from the Avro schema is:
>>>
>>>                 message avro.parquet.model.Profile {
>>>                    required group nest2 (LIST) {
>>>                      repeated group array {
>>>                        required group nest1 {
>>>                          required int32 mpi;
>>>                          required int32 ts;
>>>                        }
>>>                      }
>>>                    }
>>>                 }
>>>
>>>                 The second level "array" field should be viewed as
>>>                 element type according
>>>                 to the following LIST backwards-compatibility rule:
>>>
>>>                     If the repeated field is a group with one field
>>>                 and is named either
>>>                     |array| or uses the |LIST|-annotated group's name
>>>                 with |_tuple|
>>>                     appended then the repeated type is the element
>>>                 type and elements are
>>>                     required.
>>>
>>>                 However, this rule is only handled in
>>>                 AvroSchemaConverter but not
>>>                 AvroIndexedRecordConverter. Thus, "array" is
>>>                 interpreted as the element
>>>                 wrapper group, and the element type of the nest2 is
>>>                 considered to be
>>>
>>>                 required group nest1 {
>>>                    required int32 mpi;
>>>                    required int32 ts;
>>>                 }
>>>
>>>                 rather than
>>>
>>>                 required group Nest2 {
>>>                    required group nest1 {
>>>                      required int32 mpi;
>>>                      required int32 ts;
>>>                    }
>>>                 }
>>>
>>>                 This PR should probably fix your issue (although I
>>>                 still have some
>>>                 questions about the testing code)
>>>                 https://github.com/apache/parquet-mr/pull/264
>>>
>>>                 BTW, recently released Spark SQL 1.5 should be able
>>>                 to read such Parquet
>>>                 files, but you need some extra work if you want to
>>>                 convert Spark SQL row
>>>                 objects to Avro records. This PR fixed this issue for
>>>                 Spark SQL:
>>>                 https://github.com/apache/spark/pull/8361
>>>
>>>                 Cheng
>>>
>>>
>>>                 On 9/19/15 12:40 AM, Cheng Lian wrote:
>>>
>>>                     You are hitting PARQUET-364
>>>                     https://issues.apache.org/jira/browse/PARQUET-364
>>>
>>>                     Although the symptoms are different, they share
>>>                     the same root cause.
>>>
>>>                     Cheng
>>>
>>>                     On 9/18/15 6:13 PM, Yan Qi wrote:
>>>
>>>                         Hi,
>>>
>>>                         The data I am working on have the following
>>>                         schema:
>>>
>>>                         {
>>>                         "namespace": "avro.parquet.model",
>>>                         "type": "record",
>>>                         "name": "Profile",
>>>                         "fields": [
>>>                         {"name": "nest1", "type": [{ "type": "array",
>>>                         "items": "Nest1"},
>>>                         "null"]},
>>>                         {"name": "nest2", "type": [{ "type": "array",
>>>                         "items": "Nest2"},
>>>                         "null"]},
>>>                         {"name": "idd", "type": "long"}
>>>                         }
>>>
>>>                         {
>>>                         "namespace": "avro.parquet.model",
>>>                         "type": "record",
>>>                         "name": "Nest1",
>>>                         "fields": [
>>>                         {"name": "ts", "type": "int"},
>>>                         {"name": "mpi", "type": "int"},
>>>                         {"name": "api1", "type": ["int", "null"],
>>>                         "default": -1},
>>>                         {"name": "api2", "type": ["long", "null"],
>>>                         "default": -1},
>>>                         {"name": "api3", "type": ["int", "null"],
>>>                         "default": -1}
>>>                         }
>>>
>>>                         {
>>>                         "namespace": "avro.parquet.model",
>>>                         "type": "record",
>>>                         "name": "Nest2",
>>>                         "fields": [
>>>                         {"name": "nest1", "type": "Nest1"},
>>>                         {"name": "ts", "type": "int"} ]
>>>                         }
>>>
>>>                         Basically there are two nested tables, nest1
>>>                         and nest2 in the table,
>>>                         Profile. In Nest2, it has a member of type
>>>                         Nest1. However, when I try to
>>>                         get access to the data with the following
>>>                         projection schema:
>>>
>>>                         {
>>>                             "type": "record",
>>>                             "name": "Profile",
>>>                             "namespace": "avro.parquet.model",
>>>                             "fields": [
>>>                               {
>>>                                 "name": "nest2",
>>>                                 "type": [
>>>                                   {
>>>                                     "type": "array",
>>>                                     "items": {
>>>                                       "type": "record",
>>>                                       "name": "Nest2",
>>>                                       "fields": [
>>>                                         {
>>>                                           "name": "nest1",
>>>                                           "type": {
>>>                                             "type": "record",
>>>                                             "name": "Nest1",
>>>                                             "fields": [
>>>                                               {
>>>                                                 "name": "mpi",
>>>                                                 "type": "int"
>>>                                               },
>>>                                               {
>>>                                                 "name": "ts",
>>>                                                 "type": "int"
>>>                                               }
>>>                                             ]
>>>                                           }
>>>                                         }
>>>                                       ]
>>>                                     }
>>>                                   }
>>>                                 ]
>>>                               }
>>>                             ]
>>>                         }
>>>
>>>                         But I get the error message,
>>>
>>>                         org.apache.parquet.io
>>>                         <http://org.apache.parquet.io
>>> >.InvalidRecordException:
>>>                         Parquet/Avro schema
>>>                         mismatch.
>>>                         Avro field 'mpi' not found.
>>>                         at
>>>
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
>>>
>>>                         ....
>>>
>>>
>>>                         When checking the code, I found that the
>>>                         function,
>>>
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
>>>
>>>                         Schema) returns false, though these two
>>>                         parameters (i.e., type and
>>>                         schema)
>>>                         are compatible. I am wondering if it is a
>>>                         bug, or I missed something
>>>                         here.
>>>
>>>                         Thanks,
>>>                         Yan
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Does Avro-Parquet support deep nesting?

Reply via email to