[
https://issues.apache.org/jira/browse/ARROW-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451368#comment-17451368
]
Weston Pace commented on ARROW-14890:
-------------------------------------
It looks to me that the ORC reader does support this capability:
https://github.com/apache/orc/blob/main/c%2B%2B/include/orc/sargs/SearchArgument.hh
The reader has a "search arguments" option which is detailed in the above
header file. This is similar to Arrow's expressions. So work would need to be
done to convert from an Arrow expression to an Orc "search argument". I see
some code for both bloom filters and min/max statistics for stripes so I would
presume they are using both under the hood.
I'm not sure if we could use this information to attach a guarantee to the
resulting batches but that could be a follow-up enhancement.
We should ideally test for tricky cases like "the expression references fields
not included in this orc file" or "the expression references fields that don't
have statistics (I'm not sure if statistics are optional or not in Orc)"
> [C++][Dataset] Add support for filter pushdown in the ORC Scanner
> -----------------------------------------------------------------
>
> Key: ARROW-14890
> URL: https://issues.apache.org/jira/browse/ARROW-14890
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: xiangxiang Shen
> Priority: Major
> Labels: dataset, orc
>
> In arrow dataset, Filter pushdown can improve reading files performance
> greatly. We notice parquet has implemented,
> https://github.com/apache/arrow/blob/35b3567e73423420a99dbe6116f000e3c77d2a4c/cpp/src/arrow/dataset/file_parquet.cc#L465-L484.
> But ORC fileformat has not supported Filter pushdown. It ignores the "filter"
> of ScanOptions now.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)