Tushar7012 commented on issue #19961:
URL: https://github.com/apache/datafusion/issues/19961#issuecomment-3793997705

   Hi @Dandandan 
   
   I'd like to work on this issue! I've analyzed the codebase and have a 
proposed approach:
   
   **Optimization Strategy:**
   1. **Store `u128` view in Entry struct** - enables fast comparison without 
dereferencing
   2. **Use `values.views()` instead of `values.iter()`** - direct access to 
raw views
   3. **Fast path for inline strings (≤12 bytes)** - compare `u128` views 
directly (single instruction)
   4. **Prefix comparison for larger strings** - check 4-byte prefix before 
full byte comparison
   
   **Expected Impact:**
   - Inline strings: Near 100% faster comparison (u128 vs byte-by-byte)
   - Large strings: Faster rejection via prefix check before dereferencing
   - Trade-off: +16 bytes memory per unique string in hash table
   
   **Files to modify:**
   - 
[datafusion/physical-expr-common/src/binary_view_map.rs](cci:7://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/physical-expr-common/src/binary_view_map.rs:0:0-0:0)
   
   I have a working implementation ready. Will create a PR with benchmark 
results shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to