[ 
https://issues.apache.org/jira/browse/ARROW-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451368#comment-17451368
 ] 

Weston Pace commented on ARROW-14890:
-------------------------------------

It looks to me that the ORC reader does support this capability: 
https://github.com/apache/orc/blob/main/c%2B%2B/include/orc/sargs/SearchArgument.hh

The reader has a "search arguments" option which is detailed in the above 
header file.  This is similar to Arrow's expressions.  So work would need to be 
done to convert from an Arrow expression to an Orc "search argument".  I see 
some code for both bloom filters and min/max statistics for stripes so I would 
presume they are using both under the hood.

I'm not sure if we could use this information to attach a guarantee to the 
resulting batches but that could be a follow-up enhancement.

We should ideally test for tricky cases like "the expression references fields 
not included in this orc file" or "the expression references fields that don't 
have statistics (I'm not sure if statistics are optional or not in Orc)"

> [C++][Dataset] Add support for filter pushdown in the ORC Scanner
> -----------------------------------------------------------------
>
>                 Key: ARROW-14890
>                 URL: https://issues.apache.org/jira/browse/ARROW-14890
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: xiangxiang Shen
>            Priority: Major
>              Labels: dataset, orc
>
> In arrow dataset, Filter pushdown can improve reading files performance 
> greatly. We notice parquet has implemented, 
> https://github.com/apache/arrow/blob/35b3567e73423420a99dbe6116f000e3c77d2a4c/cpp/src/arrow/dataset/file_parquet.cc#L465-L484.
> But ORC fileformat has not supported Filter pushdown. It ignores the "filter" 
> of  ScanOptions now.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to