ikaronen-relex opened a new pull request, #494: URL: https://github.com/apache/commons-text/pull/494
I was looking for an efficient method to convert a string to camel case and came across CaseUtils.toCamelCase. While the method looked pretty close to what I wanted, I noticed that the implementation seemed somewhat ugly and inefficient and appeared to be buggy in some rare edge cases. Here's a patch to clean it up. This patch fixes two bugs, both of which can only occur in rare edge cases (if at all): - In one of the branches of the toCamelCase inner loop, Character.charCount was mistakenly called on the title-case version of the character instead of the original, but the result was still used to increment the index to the original input. If the two counts differed (which I doubt is possible in current Unicode, but could _in theory_ happen in some future Unicode revision), this could cause the following character to be skipped and/or an extra unpaired surrogate to be inserted into the output. - Due to incorrect indexing in the toDelimiterSet helper method, if a non-BMP character was used as a delimiter, the low surrogate half of the character would also be added to the delimiter set. This can only affect the output if the input string contains mispaired surrogates. The patch also cleans up the code and improves its performance in several ways: - Duplicate code to increment the input index in the toCamelCase inner loop is moved out of the `if` statement. - Ugly code to get the Unicode code point of the ASCII space character (32) is replaced with a constant. - If no custom delimiters are passed to toCamelCase, a constant single-element delimiter set is used to avoid building a new HashSet on each call. - Also, when only one custom delimiter character is used, the delimiter set is constructed using `Set.of(a, b)`, which is likely more efficient than using a HashSet. - The input string in toCamelCase is no longer first converted to all lowercase, but is processed as is, saving some time. The last change also introduces a minor change to (undocumented and untested) behavior: - Previously, using uppercase letters as custom delimiters would have no effect, while lowercase custom delimiters would match both their upper and lower case variants. Now custom delimiters are always matched case-sensitively. I would personally consider the new behavior preferable to the old behavior, both since it makes more sense and since it allows a more efficient implementation. Anyway, while it does not seem likely that anyone would actually ever use letters as custom delimiters for toCamelCase in practice, I have nonetheless added a unit test for this edge case (as well as several others). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org