Insufficiencies in JEP: 400: UTF-8 by Default

Marco Sun, 14 Mar 2021 08:02:04 -0700

Hi all,

the JEP generally paints the picture that using the OS charset would be 
incorrect or useless, it is however the right and perfectly valid choice for 
communicating with other local programs where no other charset was specified. 
It is the same as UTF-8 most of the time, but not always and especially not on 
Windows, using UTF-8 every time would be strictly less correct.


Per [1] LC_CTYPE defines the charset to use for transforming between binary 
data and text. Given that the file.encoding system property doesn't exist 
within Java SE, LC_CTYPE combined with the current specification of 
Charset.defaultCharset() is the only compliant way to change the default 
charset in Java SE outside some custom application specific handling. Ignoring 
LC_CTYPE obviously leaves no standard approach. From the program's POV the 
same applies in reverse, currently one could only use Charset.defaultCharset() 
to determine the OS charset or let the java.io methods infer it through the 
charset-less constructors, then potentially read it back through e.g. 
InputStreamReader.getEncoding().

The OS charset is still relevant for text interaction on System.in/out/err, 
sub-process stdin/stdout/stderr and files with unknown encoding. Programs like 
grep assume the files are encoded according to LC_CTYPE, much like a similarly 
designed Java program that uses the OS charset on purpose. Constructing a 
Reader for stdin properly requires some way to determine the relevant OS 
encoding.

I'm perfectly happy with changing the charset-less methods to use UTF-8 since 
it's the best choice outside the above scenarios, despite the compatibility 
impact. Dropping standardized support for the OS charset however not only 
breaks the above interactions, but also leaves no nice migration path. The -
Dfile.encoding=COMPAT workaround is explicitly not standardized and isn't 
available to the Java application itself, only to whoever starts the JVM to 
presumably work around outdated code.

IMO Charset should provide standardized getters for the OS charset and the 
console charset. The latter being different has been a long standing issue on 
Windows where the codepage differs between its CLI and regular environments. 
OpenJDK has the necessary data already available in its custom system 
properties.

The console charset is currently hidden behind PrintStream not exposing the 
underlying OSWriter and not offering getEncoding() itself. The OS charset 
would be hidden in the future by Charset.getDefaultCharset()'s specification 
change in JEP 400.

Please consider the above minor additions to fix those issues for good.

Best regards,

Marco

[1] https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html

Insufficiencies in JEP: 400: UTF-8 by Default

Reply via email to