Sorry, missed link of my 2nd test case: https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155

Cheng

On 9/21/15 10:52 AM, Cheng Lian wrote:
I only checked the schema part while writing my last reply. However, after some more testing, I found that actually parquet-avro 1.7.0 can correctly decode your data, and "isElementType" does return true because of this "if" branch <https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522>. May I know the version of parquet-avro/parquet-mr you were working with?

Here are my tests (in Scala, with parquet-mr 1.7.0):

- Case 1 <https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135>: Reading/writing a raw Parquet file containing Avro array of single-element records. - Case 2: The Profile example you provided. Avro definitions can be found here <https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35>.

Is it possible that the Profile example schema you provided doesn't fully demonstrate the nature of your problematic data?

Cheng

On 9/21/15 10:04 AM, Yan Qi wrote:
Hi Cheng,

Thanks a lot for your help!

A quick question: do we need to regenerate the data to reflect this change?

Thanks,
Yan

On Sat, Sep 19, 2015 at 12:55 AM, Cheng Lian<[email protected]>  wrote:

To be more specific, the Parquet schema converted from the Avro schema is:

message avro.parquet.model.Profile {
   required group nest2 (LIST) {
     repeated group array {
       required group nest1 {
         required int32 mpi;
         required int32 ts;
       }
     }
   }
}

The second level "array" field should be viewed as element type according
to the following LIST backwards-compatibility rule:

    If the repeated field is a group with one field and is named either
    |array| or uses the |LIST|-annotated group's name with |_tuple|
    appended then the repeated type is the element type and elements are
    required.

However, this rule is only handled in AvroSchemaConverter but not
AvroIndexedRecordConverter. Thus, "array" is interpreted as the element
wrapper group, and the element type of the nest2 is considered to be

required group nest1 {
   required int32 mpi;
   required int32 ts;
}

rather than

required group Nest2 {
   required group nest1 {
     required int32 mpi;
     required int32 ts;
   }
}

This PR should probably fix your issue (although I still have some
questions about the testing code)
https://github.com/apache/parquet-mr/pull/264

BTW, recently released Spark SQL 1.5 should be able to read such Parquet
files, but you need some extra work if you want to convert Spark SQL row
objects to Avro records. This PR fixed this issue for Spark SQL:
https://github.com/apache/spark/pull/8361

Cheng


On 9/19/15 12:40 AM, Cheng Lian wrote:

You are hitting PARQUET-364
https://issues.apache.org/jira/browse/PARQUET-364

Although the symptoms are different, they share the same root cause.

Cheng

On 9/18/15 6:13 PM, Yan Qi wrote:

Hi,

The data I am working on have the following schema:

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "nest1", "type": [{ "type": "array", "items": "Nest1"},
"null"]},
{"name": "nest2", "type": [{ "type": "array", "items": "Nest2"},
"null"]},
{"name": "idd", "type": "long"}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest1",
"fields": [
{"name": "ts", "type": "int"},
{"name": "mpi", "type": "int"},
{"name": "api1", "type": ["int", "null"], "default": -1},
{"name": "api2", "type": ["long", "null"], "default": -1},
{"name": "api3", "type": ["int", "null"], "default": -1}
}

{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest2",
"fields": [
{"name": "nest1", "type": "Nest1"},
{"name": "ts", "type": "int"} ]
}

Basically there are two nested tables, nest1 and nest2 in the table,
Profile. In Nest2, it has a member of type Nest1. However, when I try to
get access to the data with the following projection schema:

{
    "type": "record",
    "name": "Profile",
    "namespace": "avro.parquet.model",
    "fields": [
      {
        "name": "nest2",
        "type": [
          {
            "type": "array",
            "items": {
              "type": "record",
              "name": "Nest2",
              "fields": [
                {
                  "name": "nest1",
                  "type": {
                    "type": "record",
                    "name": "Nest1",
                    "fields": [
                      {
                        "name": "mpi",
                        "type": "int"
                      },
                      {
                        "name": "ts",
                        "type": "int"
                      }
                    ]
                  }
                }
              ]
            }
          }
        ]
      }
    ]
}

But I get the error message,

org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
mismatch.
Avro field 'mpi' not found.
at
org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)

....


When checking the code, I found that the function,
org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,

Schema) returns false, though these two parameters (i.e., type and
schema)
are compatible. I am wondering if it is a bug, or I missed something
here.

Thanks,
Yan




Reply via email to