Hi Cheng,
I am now using parquet-mr-1.8.0. Though I doubt it could solve the
issue. I had a further investigation by trying the following query
(see the image below)
Inline image 1
I added one more attribute in the projection schema, which is under
Nest2 directly. Surprisingly the org.apache.parquet.io
<http://org.apache.parquet.io>.InvalidRecordExceptionion is gone.
However, there is another exception, as
org.apache.parquet.io.ParquetDecodingException: Can not read value at
0 in block -1 in file file:/path/to/part-00011-avro-parquet-100
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
...
Caused by: org.apache.parquet.io.ParquetDecodingException: The
requested schema is not compatible with the file schema. incompatible
types: required group nest2 (LIST) {
...
I am a lit bit confused by the result. Could you help me understand
this a bit?
Thanks,
Yan
On Mon, Sep 21, 2015 at 10:53 AM, Cheng Lian <[email protected]
<mailto:[email protected]>> wrote:
Sorry, missed link of my 2nd test case:
https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L137-L155
Cheng
On 9/21/15 10:52 AM, Cheng Lian wrote:
I only checked the schema part while writing my last reply.
However, after some more testing, I found that actually
parquet-avro 1.7.0 can correctly decode your data, and
"isElementType" does return true because of this "if" branch
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L518-L522>.
May I know the version of parquet-avro/parquet-mr you were
working with?
Here are my tests (in Scala, with parquet-mr 1.7.0):
- Case 1
<https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/test/scala/com/databricks/parquet/schema/ListCompatibilitySuite.scala#L103-L135>:
Reading/writing a raw Parquet file containing Avro array of
single-element records.
- Case 2: The Profile example you provided. Avro definitions
can be found here
<https://github.com/liancheng/parquet-compat/blob/e3f1705024e131b3284b2a12b0ab077c934ad956/src/main/avro/parquet-avro-compat.avdl#L24-L35>.
Is it possible that the Profile example schema you provided
doesn't fully demonstrate the nature of your problematic data?
Cheng
On 9/21/15 10:04 AM, Yan Qi wrote:
Hi Cheng,
Thanks a lot for your help!
A quick question: do we need to regenerate the data to
reflect this change?
Thanks,
Yan
On Sat, Sep 19, 2015 at 12:55 AM, Cheng
Lian<[email protected] <mailto:[email protected]>>
wrote:
To be more specific, the Parquet schema converted from
the Avro schema is:
message avro.parquet.model.Profile {
required group nest2 (LIST) {
repeated group array {
required group nest1 {
required int32 mpi;
required int32 ts;
}
}
}
}
The second level "array" field should be viewed as
element type according
to the following LIST backwards-compatibility rule:
If the repeated field is a group with one field
and is named either
|array| or uses the |LIST|-annotated group's name
with |_tuple|
appended then the repeated type is the element
type and elements are
required.
However, this rule is only handled in
AvroSchemaConverter but not
AvroIndexedRecordConverter. Thus, "array" is
interpreted as the element
wrapper group, and the element type of the nest2 is
considered to be
required group nest1 {
required int32 mpi;
required int32 ts;
}
rather than
required group Nest2 {
required group nest1 {
required int32 mpi;
required int32 ts;
}
}
This PR should probably fix your issue (although I
still have some
questions about the testing code)
https://github.com/apache/parquet-mr/pull/264
BTW, recently released Spark SQL 1.5 should be able to
read such Parquet
files, but you need some extra work if you want to
convert Spark SQL row
objects to Avro records. This PR fixed this issue for
Spark SQL:
https://github.com/apache/spark/pull/8361
Cheng
On 9/19/15 12:40 AM, Cheng Lian wrote:
You are hitting PARQUET-364
https://issues.apache.org/jira/browse/PARQUET-364
Although the symptoms are different, they share
the same root cause.
Cheng
On 9/18/15 6:13 PM, Yan Qi wrote:
Hi,
The data I am working on have the following
schema:
{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "nest1", "type": [{ "type": "array",
"items": "Nest1"},
"null"]},
{"name": "nest2", "type": [{ "type": "array",
"items": "Nest2"},
"null"]},
{"name": "idd", "type": "long"}
}
{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest1",
"fields": [
{"name": "ts", "type": "int"},
{"name": "mpi", "type": "int"},
{"name": "api1", "type": ["int", "null"],
"default": -1},
{"name": "api2", "type": ["long", "null"],
"default": -1},
{"name": "api3", "type": ["int", "null"],
"default": -1}
}
{
"namespace": "avro.parquet.model",
"type": "record",
"name": "Nest2",
"fields": [
{"name": "nest1", "type": "Nest1"},
{"name": "ts", "type": "int"} ]
}
Basically there are two nested tables, nest1
and nest2 in the table,
Profile. In Nest2, it has a member of type
Nest1. However, when I try to
get access to the data with the following
projection schema:
{
"type": "record",
"name": "Profile",
"namespace": "avro.parquet.model",
"fields": [
{
"name": "nest2",
"type": [
{
"type": "array",
"items": {
"type": "record",
"name": "Nest2",
"fields": [
{
"name": "nest1",
"type": {
"type": "record",
"name": "Nest1",
"fields": [
{
"name": "mpi",
"type": "int"
},
{
"name": "ts",
"type": "int"
}
]
}
}
]
}
}
]
}
]
}
But I get the error message,
org.apache.parquet.io
<http://org.apache.parquet.io>.InvalidRecordException:
Parquet/Avro schema
mismatch.
Avro field 'mpi' not found.
at
org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:133)
....
When checking the code, I found that the function,
org.apache.parquet.avro.AvroIndexedRecordConverter.AvroArrayConverter.isElementType(Type,
Schema) returns false, though these two
parameters (i.e., type and
schema)
are compatible. I am wondering if it is a bug,
or I missed something
here.
Thanks,
Yan