[jira] [Commented] (ARROW-17959) [C++][Dataset] Optimize Parquet column projection for subset of nested field

David Li (Jira) Fri, 14 Oct 2022 05:10:39 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617677#comment-17617677
 ]


David Li commented on ARROW-17959:
----------------------------------

I'm not sure anymore, it's been too long. struct_field came after, so I would 
really say that we shouldn't apply struct_field at all (or should only do it 
when needed) - Weston's "V2" scanner has facilities for this (there is an 
explicit interface for adapting the file-level schema to the dataset-level 
schema where we can apply needed transformations)

I have the old branch with the original changes (incl nested columns in 
Parquet) if that's useful/interesting and I can push that up somewhere

> [C++][Dataset] Optimize Parquet column projection for subset of nested field
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-17959
>                 URL: https://issues.apache.org/jira/browse/ARROW-17959
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Currently, when reading a subfield of a nested column of a Parquet file using 
> the Dataset API, we read the full parent column instead of only the requested 
> field. This should be optimized to only read the field itself.
> This was left as a TODO in ARROW-14658 
> (https://github.com/apache/arrow/pull/11704) which added the initial support 
> for nested field refs in dataset scanning 
> (https://github.com/apache/arrow/blob/c29ca51f44eaf41c3a2f6f72e3e23a7b428211c2/cpp/src/arrow/dataset/file_parquet.cc#L240-L246):
> {code}
>   if (field) {
>     // TODO(ARROW-1888): support fine-grained column projection. We should be
>     // able to materialize only the child fields requested, and not the entire
>     // top-level field.
>     // Right now, if enabled, projection/filtering will fail when they cast 
> the
>     // physical schema to the dataset schema.
>     AddColumnIndices(*toplevel, columns_selection);
> {code}
> Some relevant comments at 
> https://github.com/apache/arrow/pull/11704#discussion_r749733765. ARROW-1888 
> was mentioned as a blocker back then, but this is resolved in the meantime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17959) [C++][Dataset] Optimize Parquet column projection for subset of nested field

Reply via email to