[ 
https://issues.apache.org/jira/browse/PARQUET-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-651:
-------------------------------
    Description: 
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
    repeated group list {
      optional group element {
        optional int64 element;
      }
    }
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array<NestedSingleElement> f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array<SingleElement> f;
}
{noformat}

Adding the following test case to {{TestArrayCompatibility}} may reproduce this 
issue:

{code:java}
@Test
public void testSpark16344() throws Exception {
  Path test = writeDirect(
      "message root {" +
          "  optional group f (LIST) {" +
          "    repeated group list {" +
          "      optional group element {" +
          "        optional int32 element;" +
          "      }" +
          "    }" +
          "  }" +
          "}",
      new DirectWriter() {
        @Override
        public void write(RecordConsumer rc) {
          rc.startMessage();
          rc.startField("f", 0);

          rc.startGroup();
          rc.startField("list", 0);

          rc.startGroup();
          rc.startField("element", 0);

          rc.startGroup();
          rc.startField("element", 0);

          rc.addInteger(42);

          rc.endField("element", 0);
          rc.endGroup();

          rc.endField("element", 0);
          rc.endGroup();

          rc.endField("list", 0);
          rc.endGroup();

          rc.endField("f", 0);
          rc.endMessage();
        }

      });

  Schema element = record("rec", field("element", primitive(Schema.Type.INT)));
  Schema expectedSchema = record("root", field("f", array(element)));

  GenericRecord expectedRecord =
      instance(expectedSchema, "f", Collections.singletonList(instance(element, 
42)));

  assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord);
}
{code}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}
{noformat}

is recognized as record field {{SingleElement.element}}. The problematic code 
lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.


  was:
Found this issue while investigating SPARK-16344.

For the following Parquet schema

{noformat}
message root {
  optional group f (LIST) {
    repeated group list {
      optional group element {
        optional int64 element;
      }
    }
  }
}
{noformat}

parquet-avro decodes it as something like this:

{noformat}
record SingleElement {
  int element;
}

record NestedSingleElement {
  SingleElement element;
}

record Spark16344Wrong {
  array<NestedSingleElement> f;
}
{noformat}

while correct interpretation should be:

{noformat}
record SingleElement {
  int element;
}

record Spark16344 {
  array<SingleElement> f;
}
{noformat}

Adding the following test case to {{TestArrayCompatibility}} may reproduce this 
issue:

{code:java}
@Test
public void testSpark16344() throws Exception {
  Path test = writeDirect(
      "message root {" +
          "  optional group f (LIST) {" +
          "    repeated group list {" +
          "      optional group element {" +
          "        optional int32 element;" +
          "      }" +
          "    }" +
          "  }" +
          "}",
      new DirectWriter() {
        @Override
        public void write(RecordConsumer rc) {
          rc.startMessage();
          rc.startField("f", 0);

          rc.startGroup();
          rc.startField("list", 0);

          rc.startGroup();
          rc.startField("element", 0);

          rc.startGroup();
          rc.startField("element", 0);

          rc.addInteger(42);

          rc.endField("element", 0);
          rc.endGroup();

          rc.endField("element", 0);
          rc.endGroup();

          rc.endField("list", 0);
          rc.endGroup();

          rc.endField("f", 0);
          rc.endMessage();
        }

      });

  Schema element = record("?", field("element", primitive(Schema.Type.INT)));
  Schema expectedSchema = record("root", field("f", array(element)));

  GenericRecord expectedRecord =
      instance(expectedSchema, "f", Collections.singletonList(instance(element, 
42)));

  assertReaderContains(newBehaviorReader(test), expectedSchema, expectedRecord);
}
{code}

The reason is that the {{element}} syntactic group for LIST in

{noformat}
<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}
{noformat}

is recognized as record field {{SingleElement.element}}. The problematic code 
lies in 
[{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
 We should probably check the standard 3-level layout first before falling back 
to the legacy 2-level layout.



> Parquet-avro fails to decode array of record with a single field name 
> "element" correctly
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-651
>                 URL: https://issues.apache.org/jira/browse/PARQUET-651
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.7.0, 1.8.0, 1.8.1
>            Reporter: Cheng Lian
>
> Found this issue while investigating SPARK-16344.
> For the following Parquet schema
> {noformat}
> message root {
>   optional group f (LIST) {
>     repeated group list {
>       optional group element {
>         optional int64 element;
>       }
>     }
>   }
> }
> {noformat}
> parquet-avro decodes it as something like this:
> {noformat}
> record SingleElement {
>   int element;
> }
> record NestedSingleElement {
>   SingleElement element;
> }
> record Spark16344Wrong {
>   array<NestedSingleElement> f;
> }
> {noformat}
> while correct interpretation should be:
> {noformat}
> record SingleElement {
>   int element;
> }
> record Spark16344 {
>   array<SingleElement> f;
> }
> {noformat}
> Adding the following test case to {{TestArrayCompatibility}} may reproduce 
> this issue:
> {code:java}
> @Test
> public void testSpark16344() throws Exception {
>   Path test = writeDirect(
>       "message root {" +
>           "  optional group f (LIST) {" +
>           "    repeated group list {" +
>           "      optional group element {" +
>           "        optional int32 element;" +
>           "      }" +
>           "    }" +
>           "  }" +
>           "}",
>       new DirectWriter() {
>         @Override
>         public void write(RecordConsumer rc) {
>           rc.startMessage();
>           rc.startField("f", 0);
>           rc.startGroup();
>           rc.startField("list", 0);
>           rc.startGroup();
>           rc.startField("element", 0);
>           rc.startGroup();
>           rc.startField("element", 0);
>           rc.addInteger(42);
>           rc.endField("element", 0);
>           rc.endGroup();
>           rc.endField("element", 0);
>           rc.endGroup();
>           rc.endField("list", 0);
>           rc.endGroup();
>           rc.endField("f", 0);
>           rc.endMessage();
>         }
>       });
>   Schema element = record("rec", field("element", 
> primitive(Schema.Type.INT)));
>   Schema expectedSchema = record("root", field("f", array(element)));
>   GenericRecord expectedRecord =
>       instance(expectedSchema, "f", 
> Collections.singletonList(instance(element, 42)));
>   assertReaderContains(newBehaviorReader(test), expectedSchema, 
> expectedRecord);
> }
> {code}
> The reason is that the {{element}} syntactic group for LIST in
> {noformat}
> <list-repetition> group <name> (LIST) {
>   repeated group list {
>     <element-repetition> <element-type> element;
>   }
> }
> {noformat}
> is recognized as record field {{SingleElement.element}}. The problematic code 
> lies in 
> [{{AvroRecordConverter.isElementType()}}|https://github.com/apache/parquet-mr/blob/bd0b5af025fab9cad8f94260138741c252f45fc8/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L858].
>  We should probably check the standard 3-level layout first before falling 
> back to the legacy 2-level layout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to