Hi, We discussed this issue on this mailing list[1] earlier this year.
I investigated the usage of these two methods and found that all use cases within JDK are suspicious, resulting in many imperceptible bugs. I hope to create a PR for this issue, deprecate these two methods, and create alternative methods for them. But I don't have the experience of making such changes, maybe I need some guidance or have more experienced people do these things. Glavo [1] https://mail.openjdk.org/pipermail/core-libs-dev/2023-January/099375.html On Sun, Apr 9, 2023 at 10:58 PM < some-java-user-99206970363698485...@vodafonemail.de> wrote: > Hello, > could you please add String & Character ASCII case conversion methods, > that is, methods which only perform case conversion on ASCII characters in > the input and leave any other characters unchanged. The conversion should > not depend on the default locale. For example: > - String: > - toAsciiLowerCase > - toAsciiUpperCase > - equalsAsciiIgnoreCase (or a better name) > - compareToAsciiIgnoreCase (or a better name) > - Character: > - toAsciiLowerCase > - toAsciiUpperCase > > This would give the following advantages: > - Increased performance (+ not be vulnerable to denial of service attacks) > - Reduced number of bugs in applications > > > Please read on for a detailed explanation. > > I assume for historic reasons (Applets) the current case conversion > methods use the Unicode conversion rules, and even worse > String.toLowerCase() and String.toUpperCase() use the default locale. While > this might back then have been a reasonable choice because Applets ran > locally in the browser and localization was a nice to have feature (or even > a requirement), nowadays Java is largely used for back-end systems and case > conversion is pretty often done for technical strings and not display text > anymore. In this context applications mostly process ASCII strings. > However, because Java does not offer any specific case conversion methods > for these cases, users still use the standard String & Character methods. > This causes the following problems [1]: > > - String.toLowerCase() & String.toUpperCase() using default locale > What this means is that depending on the OS locale your application > might behave differently or fail [2]. For the scale of this, simply look in > the OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~ > "turkish locale" > At this point you probably have to add a disclaimer to any Java program > that running it on systems with Turkish (and possibly others) as locale is > not supported, because either your own code or the libraries you are using > might be calling toLowerCase() or toUpperCase() [3]. > > - Bad performance for Unicode aware case conversions > Compared to simply performing ASCII case conversion, applying Unicode > case conversion has worse performance. In some cases it can even have > extremely bad performance (JDK-8292573). This could have security > implications. > > - Bugs due to case conversion changing string length > Unicode case conversion for certain strings can change the length, > either increasing or decreasing the size of the string (or when combining > both, shifting position of characters in the string while keeping the > length the same). If an application assumes that the length of the string > remains the same and uses data derived from the original string (e.g. > character indices or length) on the converted string this can lead to > exceptions or potentially even security issues. > > - Unicode characters mapping to ASCII chars > When performing case conversion on certain non-ASCII Unicode characters, > the results are ASCII characters. For example > `Character.toLowerCase('\u212A') == 'k'`. This could have security > implications. > > - Update to Unicode data changing application behavior > Unicode evolves over time, and the JDK regularly updates the Unicode > data it is using. Even if an application is not affected by the issues > mentioned above, it could become affected by them when the Unicode data is > updated in a newer JDK version. > > My main point here is that (I assume) in many cases Java applications > don't need Unicode case conversion, let alone Unicode case conversion using > the default locale. If Java offered ASCII-only case conversion methods, > then hopefully users would (where applicable) switch to these methods over > time and avoid all the issues mentioned above. And even if they > accidentally use the ASCII-only methods for display text, the result might > be a minor inconvenience for users seeing the display text, compared to in > the other cases application bugs and security vulnerabilities. > > Related information about other programming languages: > - Rust: Has dedicated methods for ASCII case conversion, e.g. > https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase > - Kotlin: Functions which implicitly use the default locale were > deprecated, see https://youtrack.jetbrains.com/issue/KT-43023 > > Risks: > - ASCII case conversion could lead to undesired results in some cases, see > the example for the word "café" on > https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that > specific example is about a display string, for which these ASCII-only > methods are not intended) > - When applications start to mix ASCII-only and the existing Unicode > conversion methods this could lead to bugs and security issues as well; > though it might also indicate a flaw in the application if it performs case > conversion on the same value in different places > > I hope you consider this suggestion. Feedback is appreciated! > > Kind regards > > ---- > > [1] I am not saying though that Java is the only affected language, it > definitely affects others as well. But that should not prevent improving > the Java API. > [2] Tool for detecting usage of such methods: > https://github.com/policeman-tools/forbidden-apis > [3] Maybe it would also be worth discussing deprecating > String.toLowerCase() and String.toUpperCase() because they seem to do more > harm than good. > > >