e-dard commented on issue #1823:
URL:
https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1041369032
Hey @Ted-Jiang!
Nice to see some of these ideas making there way into Datafusion! I
developed some of these ideas for IOx's Read Buffer happened in 2020.
At the time I chose `croaring-rs` for a couple of reasons:
- performance: I did some benchmarking and it was faster than the pure rust
crate (sadly I can't find these benchmarks on my machine now).
- reliability: `croaring-rs` wraps the officially maintained C/C++ version,
which generally means it's a lower risk choice.
The TLDR of how I use bitmaps in the Read Buffer is as follows:
- constant time row identification for predicates that match `column op
literal` (which is the vast majority for InfluxData's use-cases). When a user
specifies one of these we already have a compressed bitmap of all matching rows
available.
- (very) late materialisation. After all predicates are applied to all
columns in memory (generally only working on the compressed representations)
then the bitsets are combined appropriately (intersected/unioned etc). Only
then does the Read Buffer begin materialising rows into output record batches
based on the ordinal offsets in the final bitmap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]