Re: RFR: 8321373: Build should use LC_ALL=C.UTF-8 [v2]

Magnus Ihse Bursie Thu, 01 Feb 2024 06:55:14 -0800

On Thu, 1 Feb 2024 13:53:56 GMT, Magnus Ihse Bursie <i...@openjdk.org> wrote:


>> We're currently setting LC_ALL=C. Not all tools will default to utf-8 as 
>> their encoding of choice when they see this locale, but use an arbitrarily 
>> encoding, which might not properly handle all UTF-8 characters. Since in 
>> practice, all our encoding is utf8, we should tell our tools this as well.
>> 
>> This will at least have effect on how Java treats path names including 
>> unicode characters.
>
> Magnus Ihse Bursie has updated the pull request with a new target base due to 
> a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains three additional 
> commits since the last revision:
> 
>  - Explicitly load StandardCharsets ascii/utf-8 in HelloClasslist
>  - Merge branch 'master' into c.utf-8
>  - 8321373: Build should use LC_ALL=C.UTF-8

So on Linux, with this patch, we will no longer include 
sun/nio/cs/StandardCharsets$Aliases, sun/nio/cs/StandardCharsets$Cache or 
sun/util/PreHashedMap. Even prior to this PR, they were not included on macOS. 

My understanding is that these are only used when selecting a character 
encoding other than US ASCII, UTF-8, or Latin-1. See this snippet from the 
generated StandardCharsets.java:


    private Charset lookup(String charsetName) {
        // By checking these built-ins we can avoid initializing Aliases,
        // Classes and Cache eagerly during bootstrap.
        //
        // Initialization of java.nio.charset.StandardCharsets should be
        // avoided here to minimize time spent in System.initPhase1, as it
        // may delay initialization of performance critical VM subsystems.
        String csn;
        if (charsetName.equals("UTF-8")) {
            return UTF_8.INSTANCE;
        } else if (charsetName.equals("US-ASCII")) {
            return US_ASCII.INSTANCE;
        } else if (charsetName.equals("ISO-8859-1")) {
            return ISO_8859_1.INSTANCE;
        } else {
            csn = canonicalize(toLower(charsetName));
        }


So my guess is that this is not necessarily a bad thing; they will need to be 
loaded if we want to look up more esoteric encodings, but that is perhaps not a 
common use case in these days of UTF-8's global triumph.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16971#issuecomment-1921511589

Re: RFR: 8321373: Build should use LC_ALL=C.UTF-8 [v2]

Reply via email to