[ 
https://issues.apache.org/jira/browse/ORC-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029238#comment-17029238
 ] 

Panagiotis Garefalakis commented on ORC-577:
--------------------------------------------

To support row-level filtering functionality as part of the ORC Reader this 
patch/PR adds an new Reader.option as:
{code:java}
Options setFilter(String columnName, Consumer<VectorizedRowBatch> filter)
{code}
The idea is to use a generic Consumer callback that can implement any kind of 
filtering logic that is completely independent of the rest of the row reading 
logic in ORC. As a result the we cut down on the total code dependency between 
ORC and the consumer frameworks.

 

The filter callback with have to set the selected and selectedSize values (that 
already exist) in the VectorizedRowBatch class. For instance the filter-example 
below will filter-out all the rows except the first one:
{code:java}
public static void intFirstRowFilter(VectorizedRowBatch batch) { 
LongColumnVector col1 = (LongColumnVector) batch.cols[0]; 
int newSize = 0; 
for (int row = 0; row <1024; ++row) { 
// Pass ony Valid key 
   if (col1.vector[row] == 0)
       batch.selected[newSize++] = row;  
   batch.selectedInUse = true; 
} 
batch.size = newSize;
{code}

The logic of the row-level filter is as follows [TreeReader 
Logic|https://github.com/apache/orc/blob/4e8572777234a46005df174748b7e49491107e85/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L2460]:
 
1) First, read the give columnName(s)
2) Evaluate the filter callback – filling the selected array and setting the 
selectedSize value as part of the VectorBatch
3) For the remaining columns of the VectorBach read only the selected rows to 
reduce the number or read/decoded rows thus saving CPU cycles

[~omalley] [~ashutoshc] [~gopalv] what do you think? can you please take a look 
at the PR for any comments/suggestions?




 

> Allow row-level filtering
> -------------------------
>
>                 Key: ORC-577
>                 URL: https://issues.apache.org/jira/browse/ORC-577
>             Project: ORC
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, ORC filters at three levels:
>  * File level
>  * Stripe (64 to 256mb) level
>  * Row group (10k row) level
> The filters are specified as Sargs (Search Arguments), which have a 
> relatively small vocabulary. Furthermore, they only filter sets of rows if 
> they can guarantee that none of the rows can pass the filter.
> There are some use cases where the user needs to read a subset of the columns 
> and apply more detailed row level filters. I'd suggest that we add a new 
> method in Reader.Options
> {{setFilter(String columnNames, Predicate<VectorizedRowBatch> filter)}}
> Where the columns named in columnNames are read expanded first, then the 
> filter is run and the rest of the data is read only if the predicate returns 
> true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to