Re: String encodeUTF8 latin1 with negatives

Brett Okken Mon, 28 Jul 2025 14:59:50 -0700

Roger,

For a String, the byte[] val is immutable, right?
And even the current behavior of checking for negatives and then cloning
would not be safe in the face of a concurrent modification, right?
Is there something else going on here which I am missing?


Thanks,
Brett

On Mon, Jul 28, 2025 at 3:19 PM Roger Riggs <[email protected]> wrote:

> Hi Brett,
>
> Extra care is needed if the input array might be modified concurrently
> with the method execution.
> When control flow decisions are made based on array contents, the
> integrity of the result depends on reading each byte of the array exactly
> once.
>
> Regards, Roger
>
>
>
> On 7/27/25 4:45 PM, Brett Okken wrote:
>
> In String.encodeUTF8, when the coder is latin1, there is a call to
> StringCoding.hasNegatives to determine if any special handling is needed.
> If not, a clone of the val is returned.
> If there are negative values, it then loops, from the beginning, through
> all the values to handle any individual negative values.
>
> Would it be better to call StringCoding.countPositives? If the result
> equals the length, the clone can still be returned. But if it does not, all
> the values which are positive can be simply copied to the target byte[] and
> only values beyond that point need to be checked again.
>
>
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300
>
>         if (!StringCoding.hasNegatives(val, 0, val.length)) {
>             return val.clone();
>         }
>
>         int dp = 0;
>         byte[] dst = StringUTF16.newBytesFor(val.length);
>         for (byte c : val) {
>             if (c < 0) {
>                 dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
>                 dst[dp++] = (byte) (0x80 | (c & 0x3f));
>             } else {
>                 dst[dp++] = c;
>             }
>         }
>
>
> Can be changed to look like:
>
>         int positives = StringCoding.countPositives(val, 0, val.length);
>         if (positives == val.length) {
>             return val.clone();
>         }
>
>         int dp = positives;
>         byte[] dst = StringUTF16.newBytesFor(val.length);
>         if (positives > 0) {
>             System.arraycopy(val, 0, dst, 0, positives);
>         }
>         for (int i=dp; i<val.length; ++i) {
>             byte c = val[i];
>             if (c < 0) {
>                 dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
>                 dst[dp++] = (byte) (0x80 | (c & 0x3f));
>             } else {
>                 dst[dp++] = c;
>             }
>         }
>
>
>
> I have done a bit of testing with the StringEncode jmh benchmark on my
> local windows device.
>
> encodeLatin1LongEnd speeds up significantly (~70%)
> encodeLatin1LongStart slows down (~20%)
> encodeLatin1Mixed speeds up by ~30%
>
> The remaining tests do not show much difference either way.
>
> Brett
>
>
>
>
>
>

Re: String encodeUTF8 latin1 with negatives

Reply via email to