To be more specific, the Parquet schema converted from the Avro schema is:

message avro.parquet.model.Profile {
  required group nest2 (LIST) {
    repeated group array {
      required group nest1 {
        required int32 mpi;
        required int32 ts;
      }
    }
  }
}

The second level "array" field should be viewed as element type according to the following LIST backwards-compatibility rule:

   If the repeated field is a group with one field and is named either
   |array| or uses the |LIST|-annotated group's name with |_tuple|
   appended then the repeated type is the element type and elements are
   required.

However, this rule is only handled in AvroSchemaConverter but not AvroIndexedRecordConverter. Thus, "array" is interpreted as the element wrapper group, and the element type of the nest2 is considered to be

required group nest1 {
  required int32 mpi;
  required int32 ts;
}

rather than

required group Nest2 {
  required group nest1 {
    required int32 mpi;
    required int32 ts;
  }
}

This PR should probably fix your issue (although I still have some questions about the testing code) https://github.com/apache/parquet-mr/pull/264

BTW, recently released Spark SQL 1.5 should be able to read such Parquet files, but you need some extra work if you want to convert Spark SQL row objects to Avro records. This PR fixed this issue for Spark SQL: https://github.com/apache/spark/pull/8361

Cheng

On 9/19/15 12:40 AM, Cheng Lian wrote:
You are hitting PARQUET-364 https://issues.apache.org/jira/browse/PARQUET-364

Although the symptoms are different, they share the same root cause.

Cheng

On 9/18/15 6:13 PM, Yan Qi wrote:
Hi,

The data I am working on have the following schema:

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "nest1", "type": [{ "type": "array", "items": "Nest1"}, "null"]}, {"name": "nest2", "type": [{ "type": "array", "items": "Nest2"}, "null"]},
{"name": "idd", "type": "long"}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest1",
"fields": [
{"name": "ts", "type": "int"},
{"name": "mpi", "type": "int"},
{"name": "api1", "type": ["int", "null"], "default": -1},
{"name": "api2", "type": ["long", "null"], "default": -1},
{"name": "api3", "type": ["int", "null"], "default": -1}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest2",
"fields": [
{"name": "nest1", "type": "Nest1"},
{"name": "ts", "type": "int"} ]
}

Basically there are two nested tables, nest1 and nest2 in the table,
Profile. In Nest2, it has a member of type Nest1. However, when I try to
get access to the data with the following projection schema:

{
   "type": "record",
   "name": "Profile",
   "namespace": "avro.parquet.model",
   "fields": [
     {
       "name": "nest2",
       "type": [
         {
           "type": "array",
           "items": {
             "type": "record",
             "name": "Nest2",
             "fields": [
               {
                 "name": "nest1",
                 "type": {
                   "type": "record",
                   "name": "Nest1",
                   "fields": [
                     {
                       "name": "mpi",
                       "type": "int"
                     },
                     {
                       "name": "ts",
                       "type": "int"
                     }
                   ]
                 }
               }
             ]
           }
         }
       ]
     }
   ]
}

But I get the error message,

org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch.
Avro field 'mpi' not found.
at
org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
....


When checking the code, I found that the function,
org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type, Schema) returns false, though these two parameters (i.e., type and schema) are compatible. I am wondering if it is a bug, or I missed something here.

Thanks,
Yan



Reply via email to