Add String & Character ASCII case conversion methods

some-java-user-99206970363698485155 Sun, 09 Apr 2023 07:58:44 -0700

Hello,
could you please add String & Character ASCII case conversion methods, that is, 
methods which only perform case conversion on ASCII characters in the input and 
leave any other characters unchanged. The conversion should not depend on the 
default locale. For example:
- String:
  - toAsciiLowerCase
  - toAsciiUpperCase
  - equalsAsciiIgnoreCase (or a better name)
  - compareToAsciiIgnoreCase (or a better name)
- Character:
  - toAsciiLowerCase
  - toAsciiUpperCase


This would give the following advantages:
- Increased performance (+ not be vulnerable to denial of service attacks)
- Reduced number of bugs in applications


Please read on for a detailed explanation.

I assume for historic reasons (Applets) the current case conversion methods use 
the Unicode conversion rules, and even worse String.toLowerCase() and 
String.toUpperCase() use the default locale. While this might back then have 
been a reasonable choice because Applets ran locally in the browser and 
localization was a nice to have feature (or even a requirement), nowadays Java 
is largely used for back-end systems and case conversion is pretty often done 
for technical strings and not display text anymore. In this context 
applications mostly process ASCII strings.
However, because Java does not offer any specific case conversion methods for 
these cases, users still use the standard String & Character methods. This 
causes the following problems [1]:

- String.toLowerCase() & String.toUpperCase() using default locale
  What this means is that depending on the OS locale your application might 
behave differently or fail [2]. For the scale of this, simply look in the 
OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~ "turkish locale"
  At this point you probably have to add a disclaimer to any Java program that 
running it on systems with Turkish (and possibly others) as locale is not 
supported, because either your own code or the libraries you are using might be 
calling toLowerCase() or toUpperCase() [3].

- Bad performance for Unicode aware case conversions
  Compared to simply performing ASCII case conversion, applying Unicode case 
conversion has worse performance. In some cases it can even have extremely bad 
performance (JDK-8292573). This could have security implications.

- Bugs due to case conversion changing string length
  Unicode case conversion for certain strings can change the length, either 
increasing or decreasing the size of the string (or when combining both, 
shifting position of characters in the string while keeping the length the 
same). If an application assumes that the length of the string remains the same 
and uses data derived from the original string (e.g. character indices or 
length) on the converted string this can lead to exceptions or potentially even 
security issues.

- Unicode characters mapping to ASCII chars
  When performing case conversion on certain non-ASCII Unicode characters, the 
results are ASCII characters. For example `Character.toLowerCase('\u212A') == 
'k'`. This could have security implications.

- Update to Unicode data changing application behavior
  Unicode evolves over time, and the JDK regularly updates the Unicode data it 
is using. Even if an application is not affected by the issues mentioned above, 
it could become affected by them when the Unicode data is updated in a newer 
JDK version.

My main point here is that (I assume) in many cases Java applications don't 
need Unicode case conversion, let alone Unicode case conversion using the 
default locale. If Java offered ASCII-only case conversion methods, then 
hopefully users would (where applicable) switch to these methods over time and 
avoid all the issues mentioned above. And even if they accidentally use the 
ASCII-only methods for display text, the result might be a minor inconvenience 
for users seeing the display text, compared to in the other cases application 
bugs and security vulnerabilities.

Related information about other programming languages:
- Rust: Has dedicated methods for ASCII case conversion, e.g. 
https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase
- Kotlin: Functions which implicitly use the default locale were deprecated, 
see https://youtrack.jetbrains.com/issue/KT-43023

Risks:
- ASCII case conversion could lead to undesired results in some cases, see the 
example for the word "café" on 
https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that specific 
example is about a display string, for which these ASCII-only methods are not 
intended)
- When applications start to mix ASCII-only and the existing Unicode conversion 
methods this could lead to bugs and security issues as well; though it might 
also indicate a flaw in the application if it performs case conversion on the 
same value in different places

I hope you consider this suggestion. Feedback is appreciated!

Kind regards

----

[1] I am not saying though that Java is the only affected language, it 
definitely affects others as well. But that should not prevent improving the 
Java API.
[2] Tool for detecting usage of such methods: 
https://github.com/policeman-tools/forbidden-apis
[3] Maybe it would also be worth discussing deprecating String.toLowerCase() 
and String.toUpperCase() because they seem to do more harm than good.

Add String & Character ASCII case conversion methods

Reply via email to