alhudz opened a new pull request, #1734:
URL: https://github.com/apache/commons-lang/pull/1734

   Repro: `splitByCharacterType("A" + boldA)`, where `boldA` is U+1D400 
MATHEMATICAL BOLD CAPITAL A.
   Expected: `["A𝐀"]`, one token, since `A` and the bold `A` are both 
upper-case letters.
   Actual: `["A", "𝐀"]`.
   Cause: the shared worker iterates one `char` at a time and calls 
`Character.getType(char)`, so each half of a surrogate pair reads as 
`SURROGATE` rather than the real category of the code point. Same-type 
neighbours get split, and in the `camelCase` path `pos - 1` lands inside the 
pair. `splitByCharacterType("5" + boldFive)` splits two decimal digits the same 
way.
   Fix: iterate by code point with `Character.codePointAt`/`charCount` and 
classify the whole code point; the `camelCase` boundary backs up by 
`Character.charCount(Character.codePointBefore(...))`. BMP input is unchanged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to