[ 
https://issues.apache.org/jira/browse/ORC-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041961#comment-17041961
 ] 

Panagiotis Garefalakis commented on ORC-577:
--------------------------------------------

After further benchmarking and profiling (see  
[ORC-597|https://issues.apache.org/jira/browse/ORC-597]) I realised that it 
does not worth skipping basic types such as Long, Short, Int, Date and Binary.  
It is more expensive to filter these types rather that just decode all and 
propagate the selected array. Skipping for these type breaks CPU instruction 
pipelining and introduces more branch misspredictions.

In the latest version of the patch we just propagate the selected array that 
can be used by the consumer framework for these simple types without additional 
overhead.
For more complex types such as Decimal, Decimal64, Double, Float, Char, 
VarChar, String, Boolean, Timestamp we can perform row-filtering that can give 
us a substantial benefit. 
For example, projecting two Double columns with up to 90% of elements 
filtered-out, can drop runtime to *half* while filtering-out 50% the elements 
can still hive us around 10% benefit.
Check attached log for details about the rest of the types and the latency 
improvements.


{code:java}
Benchmark                                                     (benchType) 
(filterColsNum)  (filterPerc)  (version)  Mode  Cnt     Score   Error  Units
DoubleRowFilterBenchmark.readOrcNoFilter        DOUBLE                2         
 0.01   ORIGINAL  avgt    2  9565.173          ms/op
DoubleRowFilterBenchmark.readOrcNoFilter        DOUBLE                2         
  0.1   ORIGINAL  avgt    2  9366.666          ms/op
DoubleRowFilterBenchmark.readOrcNoFilter        DOUBLE                2         
  0.5   ORIGINAL  avgt    2  9299.646          ms/op
DoubleRowFilterBenchmark.readOrcRowFilter       DOUBLE                2         
 0.01   ORIGINAL  avgt    2  4479.986          ms/op
DoubleRowFilterBenchmark.readOrcRowFilter       DOUBLE                2         
  0.1   ORIGINAL  avgt    2  5520.556          ms/op
DoubleRowFilterBenchmark.readOrcRowFilter       DOUBLE                2         
  0.5   ORIGINAL  avgt    2  8758.157          ms/op
{code}




> Allow row-level filtering
> -------------------------
>
>                 Key: ORC-577
>                 URL: https://issues.apache.org/jira/browse/ORC-577
>             Project: ORC
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>         Attachments: RowFilterBenchBoolean.out, RowFilterBenchDecimal.out, 
> RowFilterBenchDouble.out, RowFilterBenchString.out, 
> RowFilterBenchTimestamp.out
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, ORC filters at three levels:
>  * File level
>  * Stripe (64 to 256mb) level
>  * Row group (10k row) level
> The filters are specified as Sargs (Search Arguments), which have a 
> relatively small vocabulary. Furthermore, they only filter sets of rows if 
> they can guarantee that none of the rows can pass the filter.
> There are some use cases where the user needs to read a subset of the columns 
> and apply more detailed row level filters. I'd suggest that we add a new 
> method in Reader.Options
> {{setFilter(String columnNames, Predicate<VectorizedRowBatch> filter)}}
> Where the columns named in columnNames are read expanded first, then the 
> filter is run and the rest of the data is read only if the predicate returns 
> true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to