Re: RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Steven Loomis Wed, 09 Oct 2024 13:29:18 -0700

On Thu, 3 Oct 2024 08:52:01 GMT, Jeremie Miserez <[email protected]> wrote:


>> Mapping ISO-8859-8-I charset to ISO-8859-8.
>> Below mentioned 2 aliases are added as part of this:-
>> **ISO-8859-8-I**
>> **ISO8859-8-I**
>> 
>> The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686
>
> One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 
> would definitely be correct.
> 
> Java strings are stored in logical order. That is true for both LTR and RTL 
> languages. This is plainly apparent from the OpenJDK String source code, but 
> also explicitly mentioned/explained e.g. by official tutorials such as here: 
> https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text
> 
> ISO-8859-8-i texts are always sent in logical order (by definition). **So 
> decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias 
> will result in the correct order of characters in the Java string, i.e. 
> logical order, and thus is always 100% correct by definition.**
> 
> Technically speaking, and for completeness sake here is the full list of 
> cases for regular ISO-8859-8 today:
> 
> 1. ISO-8859-8 texts may contain either LTR language content, in which case 
> the text is correctly decoded to a Java string in logical order. -> OK
> 2. ISO-8859-8 texts may also contain RTL language content in logical order 
> (newer applications already do this), in which case the text is also 
> correctly decoded to a Java string in logical order. -> OK.
> 3. But: If a ISO-8859-8 text contains RTL language content in visual order 
> (old applications, historically the case), the text would be decoded to a 
> Java string in visual order. This is actually technically incorrect and may 
> be a source of bugs (e.g. concatenation won't work correctly). However this 
> behavior cannot be changed in OpenJDK anymore as (old) applications may rely 
> on it.
> 
> So: Case 2 is what would happen if the alias was added. Now as long as nobody 
> adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 
> text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), 
> using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. 
> The alias will result in case 2, i.e. texts will always be decoded into the 
> correct Java string in logical order.

@jmiserez wrote:

> But: If a ISO-8859-8 text contains RTL language content in visual order (old 
> applications, historically the case), the text would be decoded to a Java 
> string in visual order. This is actually technically incorrect and may be a 
> source of bugs (e.g. concatenation won't work correctly). However this 
> behavior cannot be changed in OpenJDK anymore as (old) applications may rely 
> on it.

In other words, Java _may_ have been incorrectly handling `ISO-8859-8` all this 
time if content was in visual order. Putting in this alias means that 
ISO-8859-8-I will be handled correctly.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20690#issuecomment-2403364716

Re: RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Reply via email to