[jira] [Comment Edited] (PARQUET-370) Nested records are not properly read if none of their fields are requested

Cheng Lian (JIRA) Thu, 10 Sep 2015 04:44:31 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734568#comment-14734568
 ]


Cheng Lian edited comment on PARQUET-370 at 9/10/15 11:43 AM:
--------------------------------------------------------------

A complete sample code for reproducing this issue against parquet-mr 1.7.0 can 
be found in 
[[email protected]/parquet-compat|https://github.com/liancheng/parquet-compat/blob/cbd9dd89b015049c43054c5db81737405f6618e2/src/test/scala/com/databricks/parquet/schema/SchemaEvolutionSuite.scala#L9-L61].
  This sample writes a Parquet file with schema {{S1}} and reads it back with 
{{S2}} as requested schema using parquet-avro.  Related Avro IDL definition can 
be found 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl].

BTW, this repository is a playground of mine for investigating various Parquet 
compatibility and interoperability issues.  The Scala DSL illustrated in the 
sample code is inspired by the {{writeDirect}} method in parquet-avro testing 
code.  It is defined 
[here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala].
  I found it pretty neat and intuitive for building test cases, and we are 
using a similar testing API in Spark.


was (Author: lian cheng):
A complete sample code for reproducing this issue against parquet-mr 1.7.0 can 
be found in 
[[email protected]/parquet-compat|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala].
  This sample writes a Parquet file with schema {{S1}} and reads it back with 
{{S2}} as requested schema using parquet-avro.  Related Avro IDL definition can 
be found 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.7.0/src/main/avro/parquet-avro-compat.avdl].
  The version against parquet-mr 1.8.1 is 
[here|https://github.com/liancheng/parquet-compat/blob/with-parquet-mr-1.8.1/src/main/scala/com/databricks/parquet/schema/PARQUET_370.scala].

BTW, this repository is a playground of mine for investigating various Parquet 
compatibility and interoperability issues.  The Scala DSL illustrated in the 
sample code is inspired by the {{writeDirect}} method in parquet-avro testing 
code.  It is defined 
[here|https://github.com/liancheng/parquet-compat/blob/db39ec3437abd3c254457c39193685e9f9dee1ed/src/main/scala/com/databricks/parquet/dsl/package.scala].
  I found it pretty neat and intuitive for building test cases, and we are 
using a similar testing API in Spark.

> Nested records are not properly read if none of their fields are requested
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-370
>                 URL: https://issues.apache.org/jira/browse/PARQUET-370
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.1
>            Reporter: Cheng Lian
>
> Say we have a Parquet file {{F}} with the following schema {{S1}}:
> {noformat}
> message root {
>   required group n {
>     optional int32 a;
>     optional int32 b;
>   }
> }
> {noformat}
> Later on, as the schema evolves, fields {{a}} and {{b}} are removed, while 
> {{c}} and {{d}} are added. Now we have schema {{S2}}:
> {noformat}
> message root {
>   required group n {
>     optional int32 c;
>     optional int32 d;
>   }
> }
> {noformat}
> {{S1}} and {{S2}} are compatible, so it should be OK to read {{F}} with 
> {{S2}} as requested schema.
> Say {{F}} contains a single record:
> {noformat}
> {"n": {"a": 1, "b": 2}}
> {noformat}
> When reading {{F}} with {{S2}}, expected output should be:
> {noformat}
> {"n": {"c": null, "d": null}}
> {noformat}
> But currently parquet-mr gives
> {noformat}
> {"n": null}
> {noformat}
> This is because {{MessageColumnIO}} finds that the physical Parquet file 
> contains no leaf columns defined in the requested schema, and shortcuts 
> record reading with an {{EmptyRecordReader}} for column {{n}}. See 
> [here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PARQUET-370) Nested records are not properly read if none of their fields are requested

Reply via email to