Re: Does Avro-Parquet support deep nesting?

Cheng Lian Mon, 21 Sep 2015 13:55:11 -0700

Could you please paste the schemas as a GitHub Gist? I couldn't get youridea since the message is pretty much garbled...

And could you please also share the schema of the physical Parquet fileyou are trying to read? You may get it with the parquet-schema commandline tool.


Cheng

On 9/21/15 1:36 PM, Yan Qi wrote:

Hi Cheng,

The projection schema in the image  are below:

{                                                 {

   "type": "record",                                 "type": "record",

   "name": "Profile",                               "name": "Profile",

   "namespace": "avro.parquet.model",               "namespace":
"avro.parquet.model",
   "fields": [                                       "fields": [

     {                                                 {

       "name": "nest2",                                 "name": "nest2",

       "type": [                                         "type": [

         {                                                 {

           "type": "array",                                 "type": "array",

           "items": {                                       "items": {

             "type": "record",                                 "type":
"record",
             "name": "Nest2",                                 "name":
"Nest2",
             "fields": [                                       "fields": [

             *{                             *
*            "name": "ts",             *
*            "type": "int"             *
*            },                 *
               {                             ===>        {

                 "name": "nest1",                                 "name":
"nest1",
                 "type": {                                         "type": {

                   "type": "record",                                 "type":
"record",
                   "name": "Nest1",                                 "name":
"Nest1",
                   "fields": [
"fields": [
                     {                                                 {

                       "name": "mpi",
"name": "mpi",
                       "type": "int"
"type": "int"
                     },                                               },

                     {                                                 {

                       "name": "ts",
"name": "ts",
                       "type": "int"
"type": "int"
                     }                                                 }

                   ] } } ]  } } ] } ]                               ] } } ]
  } } ] } ]
}                                                 }



I just realized that in my previous experiment, I made a mistake. Then I
reran the reading job with the modified schema on the right hand side (see
above), and it succeeded.

I am wondering how this can work.

Thanks,
Yan



On Mon, Sep 21, 2015 at 1:02 PM, Cheng Lian <[email protected]> wrote:

Also, could you please provide the full exception stack trace?

Cheng


On 9/21/15 1:01 PM, Cheng Lian wrote:

Hm... unfortunately the inline image fails to show up :(

On 9/21/15 12:06 PM, Yan Qi wrote:

Hi Cheng,

I am now using parquet-mr-1.8.0. Though I doubt it could solve the
issue. I had a further investigation by trying the following query (see the
image below)

Inline image 1

I added one more attribute in the projection schema, which is under
Nest2 directly. Surprisingly the org.apache.parquet.io <
http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.

However, there is another exception, as

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0
in block -1 in file file:/path/to/part-00011-avro-parquet-100
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
...
Caused by: org.apache.parquet.io.ParquetDecodingException: The
requested schema is not compatible with the file schema. incompatible
types: required group nest2 (LIST) {
...

I am a lit bit confused by the result. Could you help me understand this
a bit?

Thanks,
Yan





On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]
<mailto:[email protected]>> wrote:

     Sorry, missed link of my 2nd test case:

https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155

     Cheng

     On 9/21/15 10:52 AM, Cheng Lian wrote:

         I only checked the schema part while writing my last reply.
         However, after some more testing, I found that actually
         parquet-avro 1.7.0 can correctly decode your data, and
         "isElementType" does return true because of this "if" branch
         <
https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522

         May I know the version of parquet-avro/parquet-mr you were
         working with?

         Here are my tests (in Scala, with parquet-mr 1.7.0):

         - Case 1
         <
https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135

         Reading/writing a raw Parquet file containing Avro array of
         single-element records.
         - Case 2: The Profile example you provided. Avro definitions
         can be found here
         <
https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35



         Is it possible that the Profile example schema you provided
         doesn't fully demonstrate the nature of your problematic data?

         Cheng

         On 9/21/15 10:04 AM, Yan Qi wrote:

             Hi Cheng,

             Thanks a lot for your help!

             A quick question: do we need to regenerate the data to
             reflect this change?

             Thanks,
             Yan

             On Sat, Sep 19, 2015 at 12:55 AM, Cheng
             Lian<[email protected]> wrote:

                 To be more specific, the Parquet schema converted
                 from the Avro schema is:

                 message avro.parquet.model.Profile {
                    required group nest2 (LIST) {
                      repeated group array {
                        required group nest1 {
                          required int32 mpi;
                          required int32 ts;
                        }
                      }
                    }
                 }

                 The second level "array" field should be viewed as
                 element type according
                 to the following LIST backwards-compatibility rule:

                     If the repeated field is a group with one field
                 and is named either
                     |array| or uses the |LIST|-annotated group's name
                 with |_tuple|
                     appended then the repeated type is the element
                 type and elements are
                     required.

                 However, this rule is only handled in
                 AvroSchemaConverter but not
                 AvroIndexedRecordConverter. Thus, "array" is
                 interpreted as the element
                 wrapper group, and the element type of the nest2 is
                 considered to be

                 required group nest1 {
                    required int32 mpi;
                    required int32 ts;
                 }

                 rather than

                 required group Nest2 {
                    required group nest1 {
                      required int32 mpi;
                      required int32 ts;
                    }
                 }

                 This PR should probably fix your issue (although I
                 still have some
                 questions about the testing code)
                 https://github.com/apache/parquet-mr/pull/264

                 BTW, recently released Spark SQL 1.5 should be able
                 to read such Parquet
                 files, but you need some extra work if you want to
                 convert Spark SQL row
                 objects to Avro records. This PR fixed this issue for
                 Spark SQL:
                 https://github.com/apache/spark/pull/8361

                 Cheng


                 On 9/19/15 12:40 AM, Cheng Lian wrote:

                     You are hitting PARQUET-364
                     https://issues.apache.org/jira/browse/PARQUET-364

                     Although the symptoms are different, they share
                     the same root cause.

                     Cheng

                     On 9/18/15 6:13 PM, Yan Qi wrote:

                         Hi,

                         The data I am working on have the following
                         schema:

                         {
                         "namespace": "avro.parquet.model",
                         "type": "record",
                         "name": "Profile",
                         "fields": [
                         {"name": "nest1", "type": [{ "type": "array",
                         "items": "Nest1"},
                         "null"]},
                         {"name": "nest2", "type": [{ "type": "array",
                         "items": "Nest2"},
                         "null"]},
                         {"name": "idd", "type": "long"}
                         }

                         {
                         "namespace": "avro.parquet.model",
                         "type": "record",
                         "name": "Nest1",
                         "fields": [
                         {"name": "ts", "type": "int"},
                         {"name": "mpi", "type": "int"},
                         {"name": "api1", "type": ["int", "null"],
                         "default": -1},
                         {"name": "api2", "type": ["long", "null"],
                         "default": -1},
                         {"name": "api3", "type": ["int", "null"],
                         "default": -1}
                         }

                         {
                         "namespace": "avro.parquet.model",
                         "type": "record",
                         "name": "Nest2",
                         "fields": [
                         {"name": "nest1", "type": "Nest1"},
                         {"name": "ts", "type": "int"} ]
                         }

                         Basically there are two nested tables, nest1
                         and nest2 in the table,
                         Profile. In Nest2, it has a member of type
                         Nest1. However, when I try to
                         get access to the data with the following
                         projection schema:

                         {
                             "type": "record",
                             "name": "Profile",
                             "namespace": "avro.parquet.model",
                             "fields": [
                               {
                                 "name": "nest2",
                                 "type": [
                                   {
                                     "type": "array",
                                     "items": {
                                       "type": "record",
                                       "name": "Nest2",
                                       "fields": [
                                         {
                                           "name": "nest1",
                                           "type": {
                                             "type": "record",
                                             "name": "Nest1",
                                             "fields": [
                                               {
                                                 "name": "mpi",
                                                 "type": "int"
                                               },
                                               {
                                                 "name": "ts",
                                                 "type": "int"
                                               }
                                             ]
                                           }
                                         }
                                       ]
                                     }
                                   }
                                 ]
                               }
                             ]
                         }

                         But I get the error message,

                         org.apache.parquet.io
                         <http://org.apache.parquet.io

.InvalidRecordException:

                         Parquet/Avro schema
                         mismatch.
                         Avro field 'mpi' not found.
                         at

org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)

                         ....


                         When checking the code, I found that the
                         function,

org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,

                         Schema) returns false, though these two
                         parameters (i.e., type and
                         schema)
                         are compatible. I am wondering if it is a
                         bug, or I missed something
                         here.

                         Thanks,
                         Yan

Re: Does Avro-Parquet support deep nesting?

Reply via email to