Re: String encodeUTF8 latin1 with negatives

Chen Liang Mon, 28 Jul 2025 15:26:13 -0700

Hi Brett, you are right.

In the scenario Roger described, the current code would already be unsafe 
because the hasNegatives check and the clone copy are two distinct reads to 
each array element, between which a mutation can happen.


Roger is mainly concerned because byte[] with LATIN1 or UTF16 data exists not 
only in String, but also in StringBuilder, which has the problems Roger 
described.

Currently, these methods in the call hierarchy do not make it clear that they 
are not tolerant of mutable arrays.

It may be reasonable to mark the encoding methods as so with comments, or more 
simply, by converting them to private instance methods that always operate on 
String instances, making sure no invalid combinations of value-coder can ever 
pass through.

Chen
________________________________
From: core-libs-dev <[email protected]> on behalf of Brett Okken 
<[email protected]>
Sent: Monday, July 28, 2025 4:59 PM
To: Roger Riggs <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: String encodeUTF8 latin1 with negatives

Roger,

For a String, the byte[] val is immutable, right?
And even the current behavior of checking for negatives and then cloning would 
not be safe in the face of a concurrent modification, right?
Is there something else going on here which I am missing?

Thanks,
Brett

On Mon, Jul 28, 2025 at 3:19 PM Roger Riggs 
<[email protected]<mailto:[email protected]>> wrote:
Hi Brett,

Extra care is needed if the input array might be modified concurrently with the 
method execution.
When control flow decisions are made based on array contents, the integrity of 
the result depends on reading each byte of the array exactly once.

Regards, Roger



On 7/27/25 4:45 PM, Brett Okken wrote:
In String.encodeUTF8, when the coder is latin1, there is a call to 
StringCoding.hasNegatives to determine if any special handling is needed. If 
not, a clone of the val is returned.
If there are negative values, it then loops, from the beginning, through all 
the values to handle any individual negative values.

Would it be better to call StringCoding.countPositives? If the result equals 
the length, the clone can still be returned. But if it does not, all the values 
which are positive can be simply copied to the target byte[] and only values 
beyond that point need to be checked again.

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300

        if (!StringCoding.hasNegatives(val, 0, val.length)) {
            return val.clone();
        }

        int dp = 0;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        for (byte c : val) {
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }


Can be changed to look like:

        int positives = StringCoding.countPositives(val, 0, val.length);
        if (positives == val.length) {
            return val.clone();
        }

        int dp = positives;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        if (positives > 0) {
            System.arraycopy(val, 0, dst, 0, positives);
        }
        for (int i=dp; i<val.length; ++i) {
            byte c = val[i];
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }



I have done a bit of testing with the StringEncode jmh benchmark on my local 
windows device.

encodeLatin1LongEnd speeds up significantly (~70%)
encodeLatin1LongStart slows down (~20%)
encodeLatin1Mixed speeds up by ~30%

The remaining tests do not show much difference either way.

Brett

Re: String encodeUTF8 latin1 with negatives

Reply via email to