Hi Cheng,
The projection schema in the image are below:
{ {
"type": "record", "type": "record",
"name": "Profile", "name": "Profile",
"namespace": "avro.parquet.model", "namespace":
"avro.parquet.model",
"fields": [ "fields": [
{ {
"name": "nest2", "name": "nest2",
"type": [ "type": [
{ {
"type": "array", "type": "array",
"items": { "items": {
"type": "record", "type":
"record",
"name": "Nest2", "name":
"Nest2",
"fields": [ "fields": [
*{ *
* "name": "ts", *
* "type": "int" *
* }, *
{ ===> {
"name": "nest1", "name":
"nest1",
"type": { "type": {
"type": "record", "type":
"record",
"name": "Nest1", "name":
"Nest1",
"fields": [
"fields": [
{ {
"name": "mpi",
"name": "mpi",
"type": "int"
"type": "int"
}, },
{ {
"name": "ts",
"name": "ts",
"type": "int"
"type": "int"
} }
] } } ] } } ] } ] ] } } ]
} } ] } ]
} }
I just realized that in my previous experiment, I made a mistake. Then I
reran the reading job with the modified schema on the right hand side (see
above), and it succeeded.
I am wondering how this can work.
Thanks,
Yan
On Mon, Sep 21, 2015 at 1:02 PM, Cheng Lian <[email protected]> wrote:
> Also, could you please provide the full exception stack trace?
>
> Cheng
>
>
> On 9/21/15 1:01 PM, Cheng Lian wrote:
>
>> Hm... unfortunately the inline image fails to show up :(
>>
>> On 9/21/15 12:06 PM, Yan Qi wrote:
>>
>>> Hi Cheng,
>>>
>>> I am now using parquet-mr-1.8.0. Though I doubt it could solve the
>>> issue. I had a further investigation by trying the following query (see the
>>> image below)
>>>
>>> Inline image 1
>>>
>>> I added one more attribute in the projection schema, which is under
>>> Nest2 directly. Surprisingly the org.apache.parquet.io <
>>> http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.
>>>
>>> However, there is another exception, as
>>>
>>> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0
>>> in block -1 in file file:/path/to/part-00011-avro-parquet-100
>>> at
>>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
>>> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
>>> ...
>>> Caused by: org.apache.parquet.io.ParquetDecodingException: The
>>> requested schema is not compatible with the file schema. incompatible
>>> types: required group nest2 (LIST) {
>>> ...
>>>
>>> I am a lit bit confused by the result. Could you help me understand this
>>> a bit?
>>>
>>> Thanks,
>>> Yan
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Sorry, missed link of my 2nd test case:
>>>
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155
>>>
>>> Cheng
>>>
>>> On 9/21/15 10:52 AM, Cheng Lian wrote:
>>>
>>> I only checked the schema part while writing my last reply.
>>> However, after some more testing, I found that actually
>>> parquet-avro 1.7.0 can correctly decode your data, and
>>> "isElementType" does return true because of this "if" branch
>>> <
>>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522
>>> >.
>>> May I know the version of parquet-avro/parquet-mr you were
>>> working with?
>>>
>>> Here are my tests (in Scala, with parquet-mr 1.7.0):
>>>
>>> - Case 1
>>> <
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135
>>> >:
>>> Reading/writing a raw Parquet file containing Avro array of
>>> single-element records.
>>> - Case 2: The Profile example you provided. Avro definitions
>>> can be found here
>>> <
>>> https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35
>>> >.
>>>
>>>
>>>
>>> Is it possible that the Profile example schema you provided
>>> doesn't fully demonstrate the nature of your problematic data?
>>>
>>> Cheng
>>>
>>> On 9/21/15 10:04 AM, Yan Qi wrote:
>>>
>>> Hi Cheng,
>>>
>>> Thanks a lot for your help!
>>>
>>> A quick question: do we need to regenerate the data to
>>> reflect this change?
>>>
>>> Thanks,
>>> Yan
>>>
>>> On Sat, Sep 19, 2015 at 12:55 AM, Cheng
>>> Lian<[email protected]> wrote:
>>>
>>> To be more specific, the Parquet schema converted
>>> from the Avro schema is:
>>>
>>> message avro.parquet.model.Profile {
>>> required group nest2 (LIST) {
>>> repeated group array {
>>> required group nest1 {
>>> required int32 mpi;
>>> required int32 ts;
>>> }
>>> }
>>> }
>>> }
>>>
>>> The second level "array" field should be viewed as
>>> element type according
>>> to the following LIST backwards-compatibility rule:
>>>
>>> If the repeated field is a group with one field
>>> and is named either
>>> |array| or uses the |LIST|-annotated group's name
>>> with |_tuple|
>>> appended then the repeated type is the element
>>> type and elements are
>>> required.
>>>
>>> However, this rule is only handled in
>>> AvroSchemaConverter but not
>>> AvroIndexedRecordConverter. Thus, "array" is
>>> interpreted as the element
>>> wrapper group, and the element type of the nest2 is
>>> considered to be
>>>
>>> required group nest1 {
>>> required int32 mpi;
>>> required int32 ts;
>>> }
>>>
>>> rather than
>>>
>>> required group Nest2 {
>>> required group nest1 {
>>> required int32 mpi;
>>> required int32 ts;
>>> }
>>> }
>>>
>>> This PR should probably fix your issue (although I
>>> still have some
>>> questions about the testing code)
>>> https://github.com/apache/parquet-mr/pull/264
>>>
>>> BTW, recently released Spark SQL 1.5 should be able
>>> to read such Parquet
>>> files, but you need some extra work if you want to
>>> convert Spark SQL row
>>> objects to Avro records. This PR fixed this issue for
>>> Spark SQL:
>>> https://github.com/apache/spark/pull/8361
>>>
>>> Cheng
>>>
>>>
>>> On 9/19/15 12:40 AM, Cheng Lian wrote:
>>>
>>> You are hitting PARQUET-364
>>> https://issues.apache.org/jira/browse/PARQUET-364
>>>
>>> Although the symptoms are different, they share
>>> the same root cause.
>>>
>>> Cheng
>>>
>>> On 9/18/15 6:13 PM, Yan Qi wrote:
>>>
>>> Hi,
>>>
>>> The data I am working on have the following
>>> schema:
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Profile",
>>> "fields": [
>>> {"name": "nest1", "type": [{ "type": "array",
>>> "items": "Nest1"},
>>> "null"]},
>>> {"name": "nest2", "type": [{ "type": "array",
>>> "items": "Nest2"},
>>> "null"]},
>>> {"name": "idd", "type": "long"}
>>> }
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Nest1",
>>> "fields": [
>>> {"name": "ts", "type": "int"},
>>> {"name": "mpi", "type": "int"},
>>> {"name": "api1", "type": ["int", "null"],
>>> "default": -1},
>>> {"name": "api2", "type": ["long", "null"],
>>> "default": -1},
>>> {"name": "api3", "type": ["int", "null"],
>>> "default": -1}
>>> }
>>>
>>> {
>>> "namespace": "avro.parquet.model",
>>> "type": "record",
>>> "name": "Nest2",
>>> "fields": [
>>> {"name": "nest1", "type": "Nest1"},
>>> {"name": "ts", "type": "int"} ]
>>> }
>>>
>>> Basically there are two nested tables, nest1
>>> and nest2 in the table,
>>> Profile. In Nest2, it has a member of type
>>> Nest1. However, when I try to
>>> get access to the data with the following
>>> projection schema:
>>>
>>> {
>>> "type": "record",
>>> "name": "Profile",
>>> "namespace": "avro.parquet.model",
>>> "fields": [
>>> {
>>> "name": "nest2",
>>> "type": [
>>> {
>>> "type": "array",
>>> "items": {
>>> "type": "record",
>>> "name": "Nest2",
>>> "fields": [
>>> {
>>> "name": "nest1",
>>> "type": {
>>> "type": "record",
>>> "name": "Nest1",
>>> "fields": [
>>> {
>>> "name": "mpi",
>>> "type": "int"
>>> },
>>> {
>>> "name": "ts",
>>> "type": "int"
>>> }
>>> ]
>>> }
>>> }
>>> ]
>>> }
>>> }
>>> ]
>>> }
>>> ]
>>> }
>>>
>>> But I get the error message,
>>>
>>> org.apache.parquet.io
>>> <http://org.apache.parquet.io
>>> >.InvalidRecordException:
>>> Parquet/Avro schema
>>> mismatch.
>>> Avro field 'mpi' not found.
>>> at
>>>
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
>>>
>>> ....
>>>
>>>
>>> When checking the code, I found that the
>>> function,
>>>
>>> org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
>>>
>>> Schema) returns false, though these two
>>> parameters (i.e., type and
>>> schema)
>>> are compatible. I am wondering if it is a
>>> bug, or I missed something
>>> here.
>>>
>>> Thanks,
>>> Yan
>>>
>>>
>>>
>>>
>>>
>>>
>>
>