[PR] [Improvement] Optimize CharRange.hashCode() to reduce collision rate and improve calculation efficiency [commons-lang]

via GitHub Sun, 14 Dec 2025 03:29:48 -0800


IcoreE opened a new pull request, #1524:
URL: https://github.com/apache/commons-lang/pull/1524


   **1. Background**
   The CharRange class (package: org.apache.commons.lang3) is a core component 
of the CharSet utility, and its hashCode() method is frequently invoked in 
scenarios such as HashMap/HashSet storage. The current implementation of 
hashCode() (linear combination) has severe hash collision problems and 
suboptimal calculation efficiency, which affects the performance of upper-layer 
applications.
   
   ```
   //Current hashCode implementation
   @Override
   public int hashCode() {
       return 83 + start + 7 * end + (negated ? 1 : 0);
   }
   ```
   **2. Problem Description**
   The current hashCode() implementation has two critical issues:
   
   - **High collision rate**: The linear combination (83 + start + 7 * end + 
flag) easily causes hash value offset. For example:
   
   1. CharRange.isNotIn((char) 1, (char) 2) (start=1, end=2, negated=true) → 
hash = 83+1+14+1=99
   2.  CharRange.isIn((char) 2, (char) 2) (start=2, end=2, negated=false) → 
hash = 83+2+14=99
   
   These two logically distinct instances (equals() returns false) have the 
same hash code, leading to serious bucket conflicts in HashMap.
   -  **Low calculation efficiency:** Relies on multiple arithmetic operations 
(1 multiplication + 3 additions), which are slower than bitwise operations and 
fail to utilize the 32-bit space of int efficiently.
   
   ```
   //  hash collision cases
   @Test
   void testHashCodeCollision() {
       // case A：hash=99
       CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true → 
83+1+14+1=99
       CharRange a2 = CharRange.isIn((char)2, (char)2);    // 2,2,false → 
83+2+14+0=99
       assertEquals(a1.hashCode(), a2.hashCode()); // Collision
       // case B：hash=123
       CharRange b1 = CharRange.isIn((char)5, (char)5);    //5,5,false 
→83+5+35+0=123
       CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true 
→83+4+35+1=123
       assertEquals(b1.hashCode(),b2.hashCode()); //Collision
   }  
   ```
   **3. Proposed Solution**
   Replace the current linear combination implementation with a bitwise 
splicing + XOR flag scheme, which maximizes the use of 32-bit int space, 
minimizes collision rate, and optimizes calculation efficiency:
   
   ```
   @Override
   public int hashCode() {
       // Splice 16-bit start (high) and 16-bit end (low) into 32-bit int (no 
overflow)
       final int result = (start << 16) | (end & 0xFFFF);
       // Integrate negated flag via XOR to further disperse hash values
       return result ^ (negated ? 0x00010000 : 0);
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Improvement] Optimize CharRange.hashCode() to reduce collision rate and improve calculation efficiency [commons-lang]

Reply via email to