Cheng Lian created PARQUET-370:
----------------------------------

             Summary: Nested records are not properly read if none of their 
fields are requested
                 Key: PARQUET-370
                 URL: https://issues.apache.org/jira/browse/PARQUET-370
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.7.0, 1.6.0, 1.5.0, 1.8.1
            Reporter: Cheng Lian


Say we have a Parquet file {{F}} with the following schema {{S1}}:
{noformat}
message root {
  required group n {
    optional int32 a;
    optional int32 b;
  }
}
{noformat}
Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while 
{{c}} and {{d}} are added. Now we have schema {{S2}}:
{noformat}
message root {
  required group n {
    optional int32 c;
    optional int32 d;
  }
}
{noformat}
{{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with {{S2}} 
as requested schema.

Say {{F}} contains a single record:
{noformat}
{"n": {"a": 1, "b": 2}}
{noformat}
When reading {{F}} with {{S2}}, expected output should be:
{noformat}
{"n": {"c": null, "d": null}}
{noformat}
But currently parquet-mr gives
{noformat}
{"n": null}
{noformat}
This is because {{MessageColumnIO}} finds that the physical Parquet file 
contains no leaf columns defined in the requested schema, and shortcuts record 
reading with an {{EmptyRecordReader}} for column {{n}}. See 
[here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to