ikaronen-relex opened a new pull request, #494:
URL: https://github.com/apache/commons-text/pull/494

   I was looking for an efficient method to convert a string to camel case and 
came across CaseUtils.toCamelCase. While the method looked pretty close to what 
I wanted, I noticed that the implementation seemed somewhat ugly and 
inefficient and appeared to be buggy in some rare edge cases. Here's a patch to 
clean it up.
   
   This patch fixes two bugs, both of which can only occur in rare edge cases 
(if at all):
   
   - In one of the branches of the toCamelCase inner loop, Character.charCount 
was mistakenly called on the title-case version of the character instead of the 
original, but the result was still used to increment the index to the original 
input. If the two counts differed (which I doubt is possible in current 
Unicode, but could _in theory_ happen in some future Unicode revision), this 
could cause the following character to be skipped and/or an extra unpaired 
surrogate to be inserted into the output.
   
   - Due to incorrect indexing in the toDelimiterSet helper method, if a 
non-BMP character was used as a delimiter, the low surrogate half of the 
character would also be added to the delimiter set. This can only affect the 
output if the input string contains mispaired surrogates.
   
   The patch also cleans up the code and improves its performance in several 
ways:
   
   - Duplicate code to increment the input index in the toCamelCase inner loop 
is moved out of the `if` statement.
   - Ugly code to get the Unicode code point of the ASCII space character (32) 
is replaced with a constant.
   - If no custom delimiters are passed to toCamelCase, a constant 
single-element delimiter set is used to avoid building a new HashSet on each 
call.
   - Also, when only one custom delimiter character is used, the delimiter set 
is constructed using `Set.of(a, b)`, which is likely more efficient than using 
a HashSet.
   - The input string in toCamelCase is no longer first converted to all 
lowercase, but is processed as is, saving some time.
   
   The last change also introduces a minor change to (undocumented and 
untested) behavior:
   
   - Previously, using uppercase letters as custom delimiters would have no 
effect, while lowercase custom delimiters would match both their upper and 
lower case variants. Now custom delimiters are always matched case-sensitively.
   
   I would personally consider the new behavior preferable to the old behavior, 
both since it makes more sense and since it allows a more efficient 
implementation. Anyway, while it does not seem likely that anyone would 
actually ever use letters as custom delimiters for toCamelCase in practice, I 
have nonetheless added a unit test for this edge case (as well as several 
others). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to