[jira] [Commented] (ARROW-17959) [C++][Dataset] Optimize Parquet column projection for subset of nested field

Joris Van den Bossche (Jira) Fri, 14 Oct 2022 02:39:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617603#comment-17617603
 ]


Joris Van den Bossche commented on ARROW-17959:
-----------------------------------------------

[~lidavidm] do you remember how you were planning to implement this? (cfr 
https://github.com/apache/arrow/pull/11704#discussion_r749733765, but I can't 
see what you did before removing it after the discussion in that comment) 

At the time you mentioned that casting structs was needed. From taking a quick 
look, to me it seems that if I naively change file_parquet.cc 
({{ResolveOneFieldRef}} to not add the top-level indices (so that Parquet 
actually only reads the required leaves), the main problem is that we "Bind" 
the FieldRef expression to the dataset schema, and at that point the FieldRef 
with names gets converted into a FieldRef backed by integer indices / 
FieldPath. But this path is into the original full struct type, and not the 
"reduced" struct that the Parquet reader now actually returns. Which then 
results in errors when actually executing this FieldRef expression with the 
"struct_field" kernel.

Or should we keep the names in the FieldRef (cfr ARROW-17989), and only bind 
the type to the FieldRef expression? Then evaluating the FieldRef expression 
should work regardless of whether the file reader in question returned the full 
struct or a subsetted version.  
Although I suppose you had something else in mind, given the reference to 
ARROW-1888.


> [C++][Dataset] Optimize Parquet column projection for subset of nested field
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-17959
>                 URL: https://issues.apache.org/jira/browse/ARROW-17959
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Currently, when reading a subfield of a nested column of a Parquet file using 
> the Dataset API, we read the full parent column instead of only the requested 
> field. This should be optimized to only read the field itself.
> This was left as a TODO in ARROW-14658 
> (https://github.com/apache/arrow/pull/11704) which added the initial support 
> for nested field refs in dataset scanning 
> (https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246):
> {code}
>   if (field) {
>     // TODO(ARROW-1888): support fine-grained column projection. We should be
>     // able to materialize only the child fields requested, and not the entire
>     // top-level field.
>     // Right now, if enabled, projection/filtering will fail when they cast 
> the
>     // physical schema to the dataset schema.
>     AddColumnIndices(*toplevel, columns_selection);
> {code}
> Some relevant comments at 
> https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888 
> was mentioned as a blocker back then, but this is resolved in the meantime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17959) [C++][Dataset] Optimize Parquet column projection for subset of nested field

Reply via email to