[
https://issues.apache.org/jira/browse/ORC-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041961#comment-17041961
]
Panagiotis Garefalakis commented on ORC-577:
--------------------------------------------
After further benchmarking and profiling (see
[ORC-597|https://issues.apache.org/jira/browse/ORC-597]) I realised that it
does not worth skipping basic types such as Long, Short, Int, Date and Binary.
It is more expensive to filter these types rather that just decode all and
propagate the selected array. Skipping for these type breaks CPU instruction
pipelining and introduces more branch misspredictions.
In the latest version of the patch we just propagate the selected array that
can be used by the consumer framework for these simple types without additional
overhead.
For more complex types such as Decimal, Decimal64, Double, Float, Char,
VarChar, String, Boolean, Timestamp we can perform row-filtering that can give
us a substantial benefit.
For example, projecting two Double columns with up to 90% of elements
filtered-out, can drop runtime to *half* while filtering-out 50% the elements
can still hive us around 10% benefit.
Check attached log for details about the rest of the types and the latency
improvements.
{code:java}
Benchmark (benchType)
(filterColsNum) (filterPerc) (version) Mode Cnt Score Error Units
DoubleRowFilterBenchmark.readOrcNoFilter DOUBLE 2
0.01 ORIGINAL avgt 2 9565.173 ms/op
DoubleRowFilterBenchmark.readOrcNoFilter DOUBLE 2
0.1 ORIGINAL avgt 2 9366.666 ms/op
DoubleRowFilterBenchmark.readOrcNoFilter DOUBLE 2
0.5 ORIGINAL avgt 2 9299.646 ms/op
DoubleRowFilterBenchmark.readOrcRowFilter DOUBLE 2
0.01 ORIGINAL avgt 2 4479.986 ms/op
DoubleRowFilterBenchmark.readOrcRowFilter DOUBLE 2
0.1 ORIGINAL avgt 2 5520.556 ms/op
DoubleRowFilterBenchmark.readOrcRowFilter DOUBLE 2
0.5 ORIGINAL avgt 2 8758.157 ms/op
{code}
> Allow row-level filtering
> -------------------------
>
> Key: ORC-577
> URL: https://issues.apache.org/jira/browse/ORC-577
> Project: ORC
> Issue Type: New Feature
> Reporter: Owen O'Malley
> Assignee: Panagiotis Garefalakis
> Priority: Major
> Attachments: RowFilterBenchBoolean.out, RowFilterBenchDecimal.out,
> RowFilterBenchDouble.out, RowFilterBenchString.out,
> RowFilterBenchTimestamp.out
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Currently, ORC filters at three levels:
> * File level
> * Stripe (64 to 256mb) level
> * Row group (10k row) level
> The filters are specified as Sargs (Search Arguments), which have a
> relatively small vocabulary. Furthermore, they only filter sets of rows if
> they can guarantee that none of the rows can pass the filter.
> There are some use cases where the user needs to read a subset of the columns
> and apply more detailed row level filters. I'd suggest that we add a new
> method in Reader.Options
> {{setFilter(String columnNames, Predicate<VectorizedRowBatch> filter)}}
> Where the columns named in columnNames are read expanded first, then the
> filter is run and the rest of the data is read only if the predicate returns
> true.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)