[
https://issues.apache.org/jira/browse/ARROW-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617603#comment-17617603
]
Joris Van den Bossche commented on ARROW-17959:
-----------------------------------------------
[~lidavidm] do you remember how you were planning to implement this? (cfr
https://github.com/apache/arrow/pull/11704#discussion_r749733765, but I can't
see what you did before removing it after the discussion in that comment)
At the time you mentioned that casting structs was needed. From taking a quick
look, to me it seems that if I naively change file_parquet.cc
({{ResolveOneFieldRef}} to not add the top-level indices (so that Parquet
actually only reads the required leaves), the main problem is that we "Bind"
the FieldRef expression to the dataset schema, and at that point the FieldRef
with names gets converted into a FieldRef backed by integer indices /
FieldPath. But this path is into the original full struct type, and not the
"reduced" struct that the Parquet reader now actually returns. Which then
results in errors when actually executing this FieldRef expression with the
"struct_field" kernel.
Or should we keep the names in the FieldRef (cfr ARROW-17989), and only bind
the type to the FieldRef expression? Then evaluating the FieldRef expression
should work regardless of whether the file reader in question returned the full
struct or a subsetted version.
Although I suppose you had something else in mind, given the reference to
ARROW-1888.
> [C++][Dataset] Optimize Parquet column projection for subset of nested field
> ----------------------------------------------------------------------------
>
> Key: ARROW-17959
> URL: https://issues.apache.org/jira/browse/ARROW-17959
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
>
> Currently, when reading a subfield of a nested column of a Parquet file using
> the Dataset API, we read the full parent column instead of only the requested
> field. This should be optimized to only read the field itself.
> This was left as a TODO in ARROW-14658
> (https://github.com/apache/arrow/pull/11704) which added the initial support
> for nested field refs in dataset scanning
> (https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246):
> {code}
> if (field) {
> // TODO(ARROW-1888): support fine-grained column projection. We should be
> // able to materialize only the child fields requested, and not the entire
> // top-level field.
> // Right now, if enabled, projection/filtering will fail when they cast
> the
> // physical schema to the dataset schema.
> AddColumnIndices(*toplevel, columns_selection);
> {code}
> Some relevant comments at
> https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888
> was mentioned as a blocker back then, but this is resolved in the meantime.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)