[
https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated PARQUET-651:
-------------------------------
Description:
Found this issue while investigating SPARK-16344.
For the following Parquet schema
{noformat}
message root {
optional group f (LIST) {
repeated group list {
optional group element {
optional int64 element;
}
}
}
}
{noformat}
parquet-avro decodes it as something like this:
{noformat}
record SingleElement {
int element;
}
record NestedSingleElement {
SingleElement element;
}
record Spark16344Wrong {
array<NestedSingleElement> f;
}
{noformat}
while correct interpretation should be:
{noformat}
record SingleElement {
int element;
}
record Spark16344 {
array<SingleElement> f;
}
{noformat}
Adding the following test case to {{TestArrayCompatibility}} may reproduce this
issue:
{code:java}
@Test
public void testSpark16344() throws Exception {
Path test = writeDirect(
"message root {" +
" optional group f (LIST) {" +
" repeated group list {" +
" optional group element {" +
" optional int32 element;" +
" }" +
" }" +
" }" +
"}",
new DirectWriter() {
@Override
public void write(RecordConsumer rc) {
rc.startMessage();
rc.startField("f", 0);
rc.startGroup();
rc.startField("list", 0);
rc.startGroup();
rc.startField("element", 0);
rc.startGroup();
rc.startField("element", 0);
rc.addInteger(42);
rc.endField("element", 0);
rc.endGroup();
rc.endField("element", 0);
rc.endGroup();
rc.endField("list", 0);
rc.endGroup();
rc.endField("f", 0);
rc.endMessage();
}
});
Schema element = record("rec", field("element", primitive(Schema.Type.INT)));
Schema expectedSchema = record("root", field("f", array(element)));
GenericRecord expectedRecord =
instance(expectedSchema, "f", Collections.singletonList(instance(element,
42)));
assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord);
}
{code}
The reason is that the {{element}} syntactic group for LIST in
{noformat}
<list-repetition> group <name> (LIST) {
repeated group list {
<element-repetition> <element-type> element;
}
}
{noformat}
is recognized as record field {{SingleElement.element}}. The problematic code
lies in
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
We should probably check the standard 3-level layout first before falling back
to the legacy 2-level layout.
was:
Found this issue while investigating SPARK-16344.
For the following Parquet schema
{noformat}
message root {
optional group f (LIST) {
repeated group list {
optional group element {
optional int64 element;
}
}
}
}
{noformat}
parquet-avro decodes it as something like this:
{noformat}
record SingleElement {
int element;
}
record NestedSingleElement {
SingleElement element;
}
record Spark16344Wrong {
array<NestedSingleElement> f;
}
{noformat}
while correct interpretation should be:
{noformat}
record SingleElement {
int element;
}
record Spark16344 {
array<SingleElement> f;
}
{noformat}
Adding the following test case to {{TestArrayCompatibility}} may reproduce this
issue:
{code:java}
@Test
public void testSpark16344() throws Exception {
Path test = writeDirect(
"message root {" +
" optional group f (LIST) {" +
" repeated group list {" +
" optional group element {" +
" optional int32 element;" +
" }" +
" }" +
" }" +
"}",
new DirectWriter() {
@Override
public void write(RecordConsumer rc) {
rc.startMessage();
rc.startField("f", 0);
rc.startGroup();
rc.startField("list", 0);
rc.startGroup();
rc.startField("element", 0);
rc.startGroup();
rc.startField("element", 0);
rc.addInteger(42);
rc.endField("element", 0);
rc.endGroup();
rc.endField("element", 0);
rc.endGroup();
rc.endField("list", 0);
rc.endGroup();
rc.endField("f", 0);
rc.endMessage();
}
});
Schema element = record("?", field("element", primitive(Schema.Type.INT)));
Schema expectedSchema = record("root", field("f", array(element)));
GenericRecord expectedRecord =
instance(expectedSchema, "f", Collections.singletonList(instance(element,
42)));
assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord);
}
{code}
The reason is that the {{element}} syntactic group for LIST in
{noformat}
<list-repetition> group <name> (LIST) {
repeated group list {
<element-repetition> <element-type> element;
}
}
{noformat}
is recognized as record field {{SingleElement.element}}. The problematic code
lies in
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
We should probably check the standard 3-level layout first before falling back
to the legacy 2-level layout.
> Parquet-avro fails to decode array of record with a single field name
> "element" correctly
> -----------------------------------------------------------------------------------------
>
> Key: PARQUET-651
> URL: https://issues.apache.org/jira/browse/PARQUET-651
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.7.0, 1.8.0, 1.8.1
> Reporter: Cheng Lian
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
> optional group f (LIST) {
> repeated group list {
> optional group element {
> optional int64 element;
> }
> }
> }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
> int element;
> }
> record NestedSingleElement {
> SingleElement element;
> }
> record Spark16344Wrong {
> array<NestedSingleElement> f;
> }
> {noformat}
> while correct interpretation should be:
> {noformat}
> record SingleElement {
> int element;
> }
> record Spark16344 {
> array<SingleElement> f;
> }
> {noformat}
> Adding the following test case to {{TestArrayCompatibility}} may reproduce
> this issue:
> {code:java}
> @Test
> public void testSpark16344() throws Exception {
> Path test = writeDirect(
> "message root {" +
> " optional group f (LIST) {" +
> " repeated group list {" +
> " optional group element {" +
> " optional int32 element;" +
> " }" +
> " }" +
> " }" +
> "}",
> new DirectWriter() {
> @Override
> public void write(RecordConsumer rc) {
> rc.startMessage();
> rc.startField("f", 0);
> rc.startGroup();
> rc.startField("list", 0);
> rc.startGroup();
> rc.startField("element", 0);
> rc.startGroup();
> rc.startField("element", 0);
> rc.addInteger(42);
> rc.endField("element", 0);
> rc.endGroup();
> rc.endField("element", 0);
> rc.endGroup();
> rc.endField("list", 0);
> rc.endGroup();
> rc.endField("f", 0);
> rc.endMessage();
> }
> });
> Schema element = record("rec", field("element",
> primitive(Schema.Type.INT)));
> Schema expectedSchema = record("root", field("f", array(element)));
> GenericRecord expectedRecord =
> instance(expectedSchema, "f",
> Collections.singletonList(instance(element, 42)));
> assertReaderContains(newBehaviorReader(test), expectedSchema,
> expectedRecord);
> }
> {code}
> The reason is that the {{element}} syntactic group for LIST in
> {noformat}
> <list-repetition> group <name> (LIST) {
> repeated group list {
> <element-repetition> <element-type> element;
> }
> }
> {noformat}
> is recognized as record field {{SingleElement.element}}. The problematic code
> lies in
> [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
> We should probably check the standard 3-level layout first before falling
> back to the legacy 2-level layout.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)