[GitHub] [arrow-datafusion] e-dard commented on issue #1823: implement bitmap_distinct function using bitmap

GitBox Wed, 16 Feb 2022 03:04:06 -0800


e-dard commented on issue #1823:
URL: 
https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1041369032



   Hey @Ted-Jiang!
   
   Nice to see some of these ideas making there way into Datafusion! I 
developed some of these ideas for IOx's Read Buffer happened in 2020.
   
   At the time I chose `croaring-rs` for a couple of reasons:
   
   - performance: I did some benchmarking and it was faster than the pure rust 
crate (sadly I can't find these benchmarks on my machine now).
   - reliability: `croaring-rs` wraps the officially maintained C/C++ version, 
which generally means it's a lower risk choice.
   
   The TLDR of how I use bitmaps in the Read Buffer is as follows:
   
    - constant time row identification for predicates that match `column op 
literal` (which is the vast majority for InfluxData's use-cases). When a user 
specifies one of these we already have a compressed bitmap of all matching rows 
available.
    - (very) late materialisation. After all predicates are applied to all 
columns in memory (generally only working on the compressed representations) 
then the bitsets are combined appropriately (intersected/unioned etc). Only 
then does the Read Buffer begin materialising rows into output record batches 
based on the ordinal offsets in the final bitmap.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] e-dard commented on issue #1823: implement bitmap_distinct function using bitmap

Reply via email to