[
https://issues.apache.org/jira/browse/PARQUET-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942271#comment-15942271
]
Itai Incze commented on PARQUET-918:
------------------------------------
Actually partial schema reads to get around parts of the schema which are not
supported yet is one of my current goals...
I'm in the process of trying to fix the problem and would like to contribute it
when its finished if possible - I've opened the this issue to the end of
figuring out how to have it meet the project standard.
Right now I have two tests in arrow-reader-writer-test.cc in a new suite that
are using {{ReadTable}} to read a simple parquet file with nested schema. One
just reads the entire file, the other with partial columns.
Also I'm in the making a fix with approach (1) work, but I have to say that
this doesn't feel like the most elegant way.
> FromParquetSchema API crashes on nested schemas
> -----------------------------------------------
>
> Key: PARQUET-918
> URL: https://issues.apache.org/jira/browse/PARQUET-918
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.0.0
> Reporter: Itai Incze
>
> {{FromParquetSchema@src/parquet/arrow/schema.cc:276}} misbehaves by using its
> column_indices parameter in the second version of the function as indices to
> the direct schema root fields.
> This is problematic with nested schema parquet files - the bug crashes the
> process by accessing the fields vector out of bounds.
> This bug is masked by another bug in the first version of the
> {{FromParquetSchema}} function which constructs a complete indices list the
> size of the number of schema fields (instead of the # of columns).
> The bug is triggered in many significant use-cases, for example when using
> the {{arrow::ReadTable}} API.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)