maartenbreddels commented on pull request #8621: URL: https://github.com/apache/arrow/pull/8621#issuecomment-725345905
The `std::vector<bool>` was a good idea, and indeed because of it's bit usage, the memory usage for Unicode isn't that heavy (most extreme: `0x10FFFF bits = 140kb` in case of a contiguous array implementation). Benchmarks: ``` set: TrimManyAscii_median 28346892 ns 28345125 ns 25 558.956MB/s 35.2794M items/s TrimManyUtf8_median 28302644 ns 28294883 ns 25 559.949MB/s 35.3421M items/s unordered_set: TrimManyAscii_median 32017530 ns 32014024 ns 22 494.898MB/s 31.2363M items/s TrimManyUtf8_median (not run) vector<bool> TrimManyAscii_median 14911543 ns 14910620 ns 47 1062.58MB/s 67.0663M items/s TrimManyUtf8_median 16148001 ns 16146053 ns 44 981.273MB/s 61.9346M items/s bitset<256> TrimManyAscii_median 14304925 ns 14304010 ns 49 1107.64MB/s 69.9105M items/s ``` `vector<bool>` is good enough I think, the bitset is consistently faster (5%), but I'd rather have similar code for both solutions. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org