RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

Uwe Schindler Wed, 21 Feb 2018 00:56:04 -0800

Hi,

> This draft JEP contains a proposal to use UTF-8 as the default charset
> for the JVM, so that
> APIs that depend on the default charset behave consistently cross all
> platforms.
> 
> For more details, please see:
> https://bugs.openjdk.java.net/browse/JDK-8187041


Thanks for finally adding a JEP like this. Thanks also to Robert Muir for 
always insisting in fixing this problem! I have a few comments:

The JEP should NOT cause that new APIs, which may convert between characters 
and bytes to no longer explicitly accept a charset. One example is the proposed 
ByteBuffer methods taking String. The default ones would work with UTF-8, but 
it should still be possible to an API user to always add a charset whenever 
there is a conversion between bytes and chars. This is especially important as 
the user may still change the default and breaking your app. Because the rule 
is still: Only YOU, the developer, know the charset of your stuff when you load 
a JAR resource file or pass a String to the network in a ByteBuffer!

The biggest offenders on this is also given as an example: FileReader and 
FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter 
and just pass the right delegate to the superclass in the ctor, both classes 
are missing the possibility to specify a charset. Because of this, the use of 
FileReader and FileWriter is completely forbidden in many Apache projects 
(Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also 
fix the API here and just add the missing ctors.

The Java 7+ methods in java.nio.file.Files already ignore the default charset 
and always use UTF-8. How to proceed with those? Should they be changed to 
behave to the new mechanisms? I'd suggest to not do this, as its part of the 
spec (to use UTF-8) and should not rely on external forces, but I wanted to 
bring this in.

Changing the default would help many users, if they are actually using newer 
JDKs. For those with older versions (and compiling their code against older 
versions), you still have to avoid the default charsets. In addition, as you 
still can change the "default charset", any library developer reading resources 
from its own JAR file or passing Strings to network protocols cannot rely on 
the fact, that the default charset is really UTF-8! (a user may have changed it 
to something else). Because of this, Apache libraries will forbid usage of all 
methods using default charsets (and locales + timezones). The "changeable 
default" does not affect application developers (because they have in most 
cases control about the environment), but library developers should always be 
explicit!

For this to work, I also want to do some "advertisement": All library projects 
should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode 
for offenders using default charsets, default locales or relying on default 
timezones. See the blog post about it [1] and the project page [2]. The tool is 
also useful to replace "jdeps" in projects with Java versions before 8, as it 
can scan your code for access to internal JDK APIs, too. See the documentation 
[3] and github wiki pages for useful examples. It may also be a good idea to 
mention it in the JEP as a "workaround" or "further reading".

Finally: Because one can still change the default, I'd propose to deprecate all 
methods that use a default charset (unrelated to actually changing the 
default). Only if you do this, it would make tools like "forbiddenapis" 
irrelevant for library developers.

And finally, finally: I'd also propose to change the default Locale to 
Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still 
break thousands of apps! But that's a different JEP - but I would strongly 
support it!

Uwe

[1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
[2] https://github.com/policeman-tools/forbidden-apis
[3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/

-----
Uwe Schindler
uschind...@apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/

RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

Reply via email to