Tushar7012 commented on issue #19961: URL: https://github.com/apache/datafusion/issues/19961#issuecomment-3793997705
Hi @Dandandan I'd like to work on this issue! I've analyzed the codebase and have a proposed approach: **Optimization Strategy:** 1. **Store `u128` view in Entry struct** - enables fast comparison without dereferencing 2. **Use `values.views()` instead of `values.iter()`** - direct access to raw views 3. **Fast path for inline strings (≤12 bytes)** - compare `u128` views directly (single instruction) 4. **Prefix comparison for larger strings** - check 4-byte prefix before full byte comparison **Expected Impact:** - Inline strings: Near 100% faster comparison (u128 vs byte-by-byte) - Large strings: Faster rejection via prefix check before dereferencing - Trade-off: +16 bytes memory per unique string in hash table **Files to modify:** - [datafusion/physical-expr-common/src/binary_view_map.rs](cci:7://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/physical-expr-common/src/binary_view_map.rs:0:0-0:0) I have a working implementation ready. Will create a PR with benchmark results shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
