[
https://issues.apache.org/jira/browse/LANG-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory resolved LANG-1802.
-----------------------------------
Fix Version/s: 3.20.1
Resolution: Fixed
Hello [~zhongxin]
Thank you for the report and pull request.
This PR does NOT include a unit test!
This is a good find, but the JRE already provides a solution. Please see git
master.
> Fix collision in CharRange.hashCode()
> -------------------------------------
>
> Key: LANG-1802
> URL: https://issues.apache.org/jira/browse/LANG-1802
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
> Affects Versions: 3.20.0
> Reporter: Zhongxin Yan
> Priority: Major
> Fix For: 3.20.1
>
>
> *1. Background*
> The CharRange class (package: org.apache.commons.lang3) is a core component
> of the CharSet utility, and its hashCode() method is frequently invoked in
> scenarios such as HashMap/HashSet storage. The current implementation of
> hashCode() (linear combination) has severe hash collision problems and
> suboptimal calculation efficiency, which affects the performance of
> upper-layer applications.
> {code:java}
> //Current hashCode implementation
> @Override
> public int hashCode() {
> return 83 + start + 7 * end + (negated ? 1 : 0);
> }{code}
> h5. *2. Problem Description*
> The current {{hashCode()}} implementation has two critical issues:
> * {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 *
> end + flag{}}}) easily causes hash value offset. For example:
> ** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true)
> → hash = 83+1+14+1=99
> ** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) →
> hash = 83+2+14=99
> These two logically distinct instances (equals() returns false) have the
> same hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
> * {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations
> (1 multiplication + 3 additions), which are slower than bitwise operations
> and fail to utilize the 32-bit space of {{int}} efficiently.
> {code:java}
> // hash collision cases
> @Test
> void testHashCodeCollision() {
> // case A:hash=99
> CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true →
> 83+1+14+1=99
> CharRange a2 = CharRange.isIn((char)2, (char)2); // 2,2,false →
> 83+2+14+0=99
> assertEquals(a1.hashCode(), a2.hashCode()); // Collision
> // case B:hash=123
> CharRange b1 = CharRange.isIn((char)5, (char)5); //5,5,false
> →83+5+35+0=123
> CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true
> →83+4+35+1=123
> assertEquals(b1.hashCode(),b2.hashCode()); //Collision
> } {code}
> *3. Proposed Solution*
> Replace the current linear combination implementation with a bitwise splicing
> + XOR flag scheme, which maximizes the use of 32-bit int space, minimizes
> collision rate, and optimizes calculation efficiency:[Github
> PR|https://github.com/apache/commons-lang/pull/1524] [~ggregory]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)