iChauster commented on PR #13366:
URL: https://github.com/apache/arrow/pull/13366#issuecomment-1163546329

   @westonpace , I think this is ready for another round of review. I've 
packaged some of the helpful benchmarking code into `benchmark_util.h`, and 
have expanded our filter benchmarks. Given some of your notes, most of these 
have been on the `FilterOverhead` version, which uses the source, node, and 
sink.
   
   Regarding the multi-pass, there are two parameters: `null probability` (the 
probability that some element in the batch is null), and 
`bool_true_probability` (the probability that the boolean in the array is 
true). 
   
   Here are some of my notes:
   - The expressions we are using seem to have far less impact on throughput as 
compared to projections. All results are around the same magnitude, ~100M 
rows/s.
   - Interestingly, selectivity can cause some big performance differences, 
possibly up to 50x speedup, reaching 2G rows/s.
   - Because of this, it may actually be advantageous in some cases for 
multi-pass filter operations. 
   > 
FilterOverhead/not_null_to_is_true_multipass_benchmark/batch_size:100000/null_prob:100/bool_true_prob:25/real_time
        400338 ns        56020 ns         1819 batches_per_second=24.9789k/s 
rows_per_second=2.49789G/s
   > 
FilterOverhead/not_null_and_is_true_singlepass_benchmark/batch_size:100000/null_prob:100/bool_true_prob:25/real_time
     1226684 ns        78326 ns          581 batches_per_second=8.15206k/s 
rows_per_second=815.206M/s
   This is probably because the first pass (checking for not null) really 
shrinks the table, making the second pass (checking for truth) fairly quicker. 
One caveat is that the passes have to be in the correct order. For example, if 
we instead check for truth first, and then not null, we do not observe the 
speedup; but our performance roughly matches single pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to