Cheng Lian created PARQUET-370:
----------------------------------
Summary: Nested records are not properly read if none of their
fields are requested
Key: PARQUET-370
URL: https://issues.apache.org/jira/browse/PARQUET-370
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.7.0, 1.6.0, 1.5.0, 1.8.1
Reporter: Cheng Lian
Say we have a Parquet file {{F}} with the following schema {{S1}}:
{noformat}
message root {
required group n {
optional int32 a;
optional int32 b;
}
}
{noformat}
Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while
{{c}} and {{d}} are added. Now we have schema {{S2}}:
{noformat}
message root {
required group n {
optional int32 c;
optional int32 d;
}
}
{noformat}
{{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with {{S2}}
as requested schema.
Say {{F}} contains a single record:
{noformat}
{"n": {"a": 1, "b": 2}}
{noformat}
When reading {{F}} with {{S2}}, expected output should be:
{noformat}
{"n": {"c": null, "d": null}}
{noformat}
But currently parquet-mr gives
{noformat}
{"n": null}
{noformat}
This is because {{MessageColumnIO}} finds that the physical Parquet file
contains no leaf columns defined in the requested schema, and shortcuts record
reading with an {{EmptyRecordReader}} for column {{n}}. See
[here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)