[ 
https://issues.apache.org/jira/browse/LANG-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongxin Yan updated LANG-1802:
-------------------------------
    Description: 
*1. Background*

The CharRange class (package: org.apache.commons.lang3) is a core component of 
the CharSet utility, and its hashCode() method is frequently invoked in 
scenarios such as HashMap/HashSet storage. The current implementation of 
hashCode() (linear combination) has severe hash collision problems and 
suboptimal calculation efficiency, which affects the performance of upper-layer 
applications.
{code:java}
//Current hashCode implementation
@Override
public int hashCode() {
    return 83 + start + 7 * end + (negated ? 1 : 0);
}{code}
h5. *2. Problem Description*

The current {{hashCode()}} implementation has two critical issues:
 * {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * end 
+ flag{}}}) easily causes hash value offset. For example:
 ** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) → 
hash = 83+1+14+1=99
 ** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) → 
hash = 83+2+14=99
 These two logically distinct instances (equals() returns false) have the same 
hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
 * {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations 
(1 multiplication + 3 additions), which are slower than bitwise operations and 
fail to utilize the 32-bit space of {{int}} efficiently.

{code:java}
//  hash collision cases
@Test
void testHashCodeCollision() {
    // case A:hash=99
    CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true → 
83+1+14+1=99
    CharRange a2 = CharRange.isIn((char)2, (char)2);    // 2,2,false → 
83+2+14+0=99
    assertEquals(a1.hashCode(), a2.hashCode()); // Collision
    // case B:hash=123
    CharRange b1 = CharRange.isIn((char)5, (char)5);    //5,5,false 
→83+5+35+0=123
    CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true 
→83+4+35+1=123
    assertEquals(b1.hashCode(),b2.hashCode()); //Collision
} {code}
*3. Proposed Solution*
Replace the current linear combination implementation with a bitwise splicing + 
XOR flag scheme, which maximizes the use of 32-bit int space, minimizes 
collision rate, and optimizes calculation efficiency:[Github 
PR|https://github.com/apache/commons-lang/pull/1524] [~ggregory] 

  was:
*1. Background*

The CharRange class (package: org.apache.commons.lang3) is a core component of 
the CharSet utility, and its hashCode() method is frequently invoked in 
scenarios such as HashMap/HashSet storage. The current implementation of 
hashCode() (linear combination) has severe hash collision problems and 
suboptimal calculation efficiency, which affects the performance of upper-layer 
applications.
{code:java}
//Current hashCode implementation
@Override
public int hashCode() {
    return 83 + start + 7 * end + (negated ? 1 : 0);
}{code}
h5. *2. Problem Description*
The current {{hashCode()}} implementation has two critical issues:
 * {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * end 
+ flag{}}}) easily causes hash value offset. For example:
 ** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) → 
hash = 83+1+14+1=99
 ** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) → 
hash = 83+2+14=99
 These two logically distinct instances (equals() returns false) have the same 
hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
 * {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations 
(1 multiplication + 3 additions), which are slower than bitwise operations and 
fail to utilize the 32-bit space of {{int}} efficiently.

{code:java}
//  hash collision cases
@Test
void testHashCodeCollision() {
    // case A:hash=99
    CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true → 
83+1+14+1=99
    CharRange a2 = CharRange.isIn((char)2, (char)2);    // 2,2,false → 
83+2+14+0=99
    assertEquals(a1.hashCode(), a2.hashCode()); // Collision
    // case B:hash=123
    CharRange b1 = CharRange.isIn((char)5, (char)5);    //5,5,false 
→83+5+35+0=123
    CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true 
→83+4+35+1=123
    assertEquals(b1.hashCode(),b2.hashCode()); //Collision
} {code}

*3. Proposed Solution*
Replace the current linear combination implementation with a bitwise splicing + 
XOR flag scheme, which maximizes the use of 32-bit int space, minimizes 
collision rate, and optimizes calculation efficiency:
[github PR|[https://github.com/apache/commons-lang/pull/1524] 
|https://github.com/apache/commons-lang/pull/1524]] [~ggregory] 


> [Improvement] Optimize CharRange.hashCode() to reduce collision rate and 
> improve calculation efficiency
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-1802
>                 URL: https://issues.apache.org/jira/browse/LANG-1802
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.*
>    Affects Versions: 3.20.0
>            Reporter: Zhongxin Yan
>            Priority: Major
>
> *1. Background*
> The CharRange class (package: org.apache.commons.lang3) is a core component 
> of the CharSet utility, and its hashCode() method is frequently invoked in 
> scenarios such as HashMap/HashSet storage. The current implementation of 
> hashCode() (linear combination) has severe hash collision problems and 
> suboptimal calculation efficiency, which affects the performance of 
> upper-layer applications.
> {code:java}
> //Current hashCode implementation
> @Override
> public int hashCode() {
>     return 83 + start + 7 * end + (negated ? 1 : 0);
> }{code}
> h5. *2. Problem Description*
> The current {{hashCode()}} implementation has two critical issues:
>  * {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * 
> end + flag{}}}) easily causes hash value offset. For example:
>  ** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) 
> → hash = 83+1+14+1=99
>  ** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) → 
> hash = 83+2+14=99
>  These two logically distinct instances (equals() returns false) have the 
> same hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
>  * {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations 
> (1 multiplication + 3 additions), which are slower than bitwise operations 
> and fail to utilize the 32-bit space of {{int}} efficiently.
> {code:java}
> //  hash collision cases
> @Test
> void testHashCodeCollision() {
>     // case A:hash=99
>     CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true → 
> 83+1+14+1=99
>     CharRange a2 = CharRange.isIn((char)2, (char)2);    // 2,2,false → 
> 83+2+14+0=99
>     assertEquals(a1.hashCode(), a2.hashCode()); // Collision
>     // case B:hash=123
>     CharRange b1 = CharRange.isIn((char)5, (char)5);    //5,5,false 
> →83+5+35+0=123
>     CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true 
> →83+4+35+1=123
>     assertEquals(b1.hashCode(),b2.hashCode()); //Collision
> } {code}
> *3. Proposed Solution*
> Replace the current linear combination implementation with a bitwise splicing 
> + XOR flag scheme, which maximizes the use of 32-bit int space, minimizes 
> collision rate, and optimizes calculation efficiency:[Github 
> PR|https://github.com/apache/commons-lang/pull/1524] [~ggregory] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to