iChauster commented on PR #13366:
URL: https://github.com/apache/arrow/pull/13366#issuecomment-1163546329
@westonpace , I think this is ready for another round of review. I've
packaged some of the helpful benchmarking code into `benchmark_util.h`, and
have expanded our filter benchmarks. Given some of your notes, most of these
have been on the `FilterOverhead` version, which uses the source, node, and
sink.
Regarding the multi-pass, there are two parameters: `null probability` (the
probability that some element in the batch is null), and
`bool_true_probability` (the probability that the boolean in the array is
true).
Here are some of my notes:
- The expressions we are using seem to have far less impact on throughput as
compared to projections. All results are around the same magnitude, ~100M
rows/s.
- Interestingly, selectivity can cause some big performance differences,
possibly up to 50x speedup, reaching 2G rows/s.
- Because of this, it may actually be advantageous in some cases for
multi-pass filter operations.
>
FilterOverhead/not_null_to_is_true_multipass_benchmark/batch_size:100000/null_prob:100/bool_true_prob:25/real_time
400338 ns 56020 ns 1819 batches_per_second=24.9789k/s
rows_per_second=2.49789G/s
>
FilterOverhead/not_null_and_is_true_singlepass_benchmark/batch_size:100000/null_prob:100/bool_true_prob:25/real_time
1226684 ns 78326 ns 581 batches_per_second=8.15206k/s
rows_per_second=815.206M/s
This is probably because the first pass (checking for not null) really
shrinks the table, making the second pass (checking for truth) fairly quicker.
One caveat is that the passes have to be in the correct order. For example, if
we instead check for truth first, and then not null, we do not observe the
speedup; but our performance roughly matches single pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]