Matthew Donoughe created MNG-8241:
-------------------------------------

             Summary: ComparableVersion incorrectly handles Unicode non-BMP 
characters
                 Key: MNG-8241
                 URL: https://issues.apache.org/jira/browse/MNG-8241
             Project: Maven
          Issue Type: Bug
            Reporter: Matthew Donoughe


Java strings are (usually) Unicode, but Java chars are a subset of Unicode. 
ComparableVersion makes heavy use of 
[String.charAt|https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html#charAt(int)],
 which will return surrogate values instead of Unicode code points whenever a 
character takes more than 16 bits.

This leads to the following behavior:

 
{noformat}
java -jar 
~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar 
1 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of 
tokens) and comparison result:
1. 1 -> 1; tokens: [1]
   1 > 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
1 (DIGIT ONE) > 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO) because ComparableVersion 
sees 𝟤 as two invalid characters and treats it as text.

 

 
{noformat}
java -jar 
~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar 
0 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of 
tokens) and comparison result:
1. 0 -> ; tokens: []
   0 < 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
However, 0 (DIGIT 0) is still < 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO). 0 < 𝟤 < 
1 the same way 0 < a < 1.

 

It's unclear whether this should be considered to be a bug or whether it's just 
an undocumented limitation. String.charAt and String.length should be avoided 
unless you can be sure the characters are all BMP (Basic Multilingual Plane).

I was initially worried that 𝟣𝟣𝟣𝟣𝟣 (MATHEMATICAL SANS-SERIF DIGIT ONE) > 22222 
(DIGIT TWO) because "𝟣𝟣𝟣𝟣𝟣".length is 10, greater than MAX_INTITEM_LENGTH, but 
that code doesn't even get hit because String.charAt is producing effectively 
"�����������". If the code is changed to identify non-BMP Nd class digits like 
𝟣 as digits then the code that determines the required size of the data type 
needs to be updated to measure the length in code points instead of chars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to