[
https://issues.apache.org/jira/browse/ARROW-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617677#comment-17617677
]
David Li commented on ARROW-17959:
----------------------------------
I'm not sure anymore, it's been too long. struct_field came after, so I would
really say that we shouldn't apply struct_field at all (or should only do it
when needed) - Weston's "V2" scanner has facilities for this (there is an
explicit interface for adapting the file-level schema to the dataset-level
schema where we can apply needed transformations)
I have the old branch with the original changes (incl nested columns in
Parquet) if that's useful/interesting and I can push that up somewhere
> [C++][Dataset] Optimize Parquet column projection for subset of nested field
> ----------------------------------------------------------------------------
>
> Key: ARROW-17959
> URL: https://issues.apache.org/jira/browse/ARROW-17959
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
>
> Currently, when reading a subfield of a nested column of a Parquet file using
> the Dataset API, we read the full parent column instead of only the requested
> field. This should be optimized to only read the field itself.
> This was left as a TODO in ARROW-14658
> (https://github.com/apache/arrow/pull/11704) which added the initial support
> for nested field refs in dataset scanning
> (https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246):
> {code}
> if (field) {
> // TODO(ARROW-1888): support fine-grained column projection. We should be
> // able to materialize only the child fields requested, and not the entire
> // top-level field.
> // Right now, if enabled, projection/filtering will fail when they cast
> the
> // physical schema to the dataset schema.
> AddColumnIndices(*toplevel, columns_selection);
> {code}
> Some relevant comments at
> https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888
> was mentioned as a blocker back then, but this is resolved in the meantime.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)