Re: Does Avro-Parquet support deep nesting?

Cheng Lian Sat, 19 Sep 2015 00:55:23 -0700

To be more specific, the Parquet schema converted from the Avro schema is:


message avro.parquet.model.Profile {
  required group nest2 (LIST) {
    repeated group array {
      required group nest1 {
        required int32 mpi;
        required int32 ts;
      }
    }
  }
}

The second level "array" field should be viewed as element typeaccording to the following LIST backwards-compatibility rule:


   If the repeated field is a group with one field and is named either
   |array| or uses the |LIST|-annotated group's name with |_tuple|
   appended then the repeated type is the element type and elements are
   required.

However, this rule is only handled in AvroSchemaConverter but notAvroIndexedRecordConverter. Thus, "array" is interpreted as the elementwrapper group, and the element type of the nest2 is considered to be


required group nest1 {
  required int32 mpi;
  required int32 ts;
}

rather than

required group Nest2 {
  required group nest1 {
    required int32 mpi;
    required int32 ts;
  }
}

This PR should probably fix your issue (although I still have somequestions about the testing code)https://github.com/apache/parquet-mr/pull/264

BTW, recently released Spark SQL 1.5 should be able to read such Parquetfiles, but you need some extra work if you want to convert Spark SQL rowobjects to Avro records. This PR fixed this issue for Spark SQL:https://github.com/apache/spark/pull/8361


Cheng

On 9/19/15 12:40 AM, Cheng Lian wrote:

You are hitting PARQUET-364https://issues.apache.org/jira/browse/PARQUET-364


Although the symptoms are different, they share the same root cause.

Cheng

On 9/18/15 6:13 PM, Yan Qi wrote:

Hi,

The data I am working on have the following schema:

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [

{"name": "nest1", "type": [{ "type": "array", "items": "Nest1"},"null"]},{"name": "nest2", "type": [{ "type": "array", "items": "Nest2"},"null"]},

{"name": "idd", "type": "long"}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest1",
"fields": [
{"name": "ts", "type": "int"},
{"name": "mpi", "type": "int"},
{"name": "api1", "type": ["int", "null"], "default": -1},
{"name": "api2", "type": ["long", "null"], "default": -1},
{"name": "api3", "type": ["int", "null"], "default": -1}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest2",
"fields": [
{"name": "nest1", "type": "Nest1"},
{"name": "ts", "type": "int"} ]
}

Basically there are two nested tables, nest1 and nest2 in the table,
Profile. In Nest2, it has a member of type Nest1. However, when I try to
get access to the data with the following projection schema:

{
   "type": "record",
   "name": "Profile",
   "namespace": "avro.parquet.model",
   "fields": [
     {
       "name": "nest2",
       "type": [
         {
           "type": "array",
           "items": {
             "type": "record",
             "name": "Nest2",
             "fields": [
               {
                 "name": "nest1",
                 "type": {
                   "type": "record",
                   "name": "Nest1",
                   "fields": [
                     {
                       "name": "mpi",
                       "type": "int"
                     },
                     {
                       "name": "ts",
                       "type": "int"
                     }
                   ]
                 }
               }
             ]
           }
         }
       ]
     }
   ]
}

But I get the error message,

org.apache.parquet.io.InvalidRecordException: Parquet/Avro schemamismatch.

Avro field 'mpi' not found.
at

org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)

....


When checking the code, I found that the function,

org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,Schema) returns false, though these two parameters (i.e., type andschema)are compatible. I am wondering if it is a bug, or I missed somethinghere.


Thanks,
Yan

Re: Does Avro-Parquet support deep nesting?

Reply via email to