Matthew Donoughe created MNG-8241:
-------------------------------------
Summary: ComparableVersion incorrectly handles Unicode non-BMP
characters
Key: MNG-8241
URL: https://issues.apache.org/jira/browse/MNG-8241
Project: Maven
Issue Type: Bug
Reporter: Matthew Donoughe
Java strings are (usually) Unicode, but Java chars are a subset of Unicode.
ComparableVersion makes heavy use of
[String.charAt|https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html#charAt(int)],
which will return surrogate values instead of Unicode code points whenever a
character takes more than 16 bits.
This leads to the following behavior:
{noformat}
java -jar
~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar
1 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of
tokens) and comparison result:
1. 1 -> 1; tokens: [1]
1 > 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
1 (DIGIT ONE) > 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO) because ComparableVersion
sees 𝟤 as two invalid characters and treats it as text.
{noformat}
java -jar
~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar
0 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of
tokens) and comparison result:
1. 0 -> ; tokens: []
0 < 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
However, 0 (DIGIT 0) is still < 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO). 0 < 𝟤 <
1 the same way 0 < a < 1.
It's unclear whether this should be considered to be a bug or whether it's just
an undocumented limitation. String.charAt and String.length should be avoided
unless you can be sure the characters are all BMP (Basic Multilingual Plane).
I was initially worried that 𝟣𝟣𝟣𝟣𝟣 (MATHEMATICAL SANS-SERIF DIGIT ONE) > 22222
(DIGIT TWO) because "𝟣𝟣𝟣𝟣𝟣".length is 10, greater than MAX_INTITEM_LENGTH, but
that code doesn't even get hit because String.charAt is producing effectively
"�����������". If the code is changed to identify non-BMP Nd class digits like
𝟣 as digits then the code that determines the required size of the data type
needs to be updated to measure the length in code points instead of chars.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)