Chen, Thanks for the clarification. The StringCoding methods are a bit light on documentation, but presumably because of the package scope.
The specific method I am proposing a change to is a private method in String. Thanks, Brett On Mon, Jul 28, 2025 at 5:26 PM Chen Liang <chen.l.li...@oracle.com> wrote: > Hi Brett, you are right. > > In the scenario Roger described, the current code would already be unsafe > because the hasNegatives check and the clone copy are two distinct reads to > each array element, between which a mutation can happen. > > Roger is mainly concerned because byte[] with LATIN1 or UTF16 data exists > not only in String, but also in StringBuilder, which has the problems Roger > described. > > Currently, these methods in the call hierarchy do not make it clear that > they are not tolerant of mutable arrays. > > It may be reasonable to mark the encoding methods as so with comments, or > more simply, by converting them to private instance methods that always > operate on String instances, making sure no invalid combinations of > value-coder can ever pass through. > > Chen > ------------------------------ > *From:* core-libs-dev <core-libs-dev-r...@openjdk.org> on behalf of Brett > Okken <brett.okken...@gmail.com> > *Sent:* Monday, July 28, 2025 4:59 PM > *To:* Roger Riggs <roger.ri...@oracle.com> > *Cc:* core-libs-dev@openjdk.org <core-libs-dev@openjdk.org> > *Subject:* Re: String encodeUTF8 latin1 with negatives > > Roger, > > For a String, the byte[] val is immutable, right? > And even the current behavior of checking for negatives and then cloning > would not be safe in the face of a concurrent modification, right? > Is there something else going on here which I am missing? > > Thanks, > Brett > > On Mon, Jul 28, 2025 at 3:19 PM Roger Riggs <roger.ri...@oracle.com> > wrote: > > Hi Brett, > > Extra care is needed if the input array might be modified concurrently > with the method execution. > When control flow decisions are made based on array contents, the > integrity of the result depends on reading each byte of the array exactly > once. > > Regards, Roger > > > > On 7/27/25 4:45 PM, Brett Okken wrote: > > In String.encodeUTF8, when the coder is latin1, there is a call to > StringCoding.hasNegatives to determine if any special handling is needed. > If not, a clone of the val is returned. > If there are negative values, it then loops, from the beginning, through > all the values to handle any individual negative values. > > Would it be better to call StringCoding.countPositives? If the result > equals the length, the clone can still be returned. But if it does not, all > the values which are positive can be simply copied to the target byte[] and > only values beyond that point need to be checked again. > > > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300 > > if (!StringCoding.hasNegatives(val, 0, val.length)) { > return val.clone(); > } > > int dp = 0; > byte[] dst = StringUTF16.newBytesFor(val.length); > for (byte c : val) { > if (c < 0) { > dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6)); > dst[dp++] = (byte) (0x80 | (c & 0x3f)); > } else { > dst[dp++] = c; > } > } > > > Can be changed to look like: > > int positives = StringCoding.countPositives(val, 0, val.length); > if (positives == val.length) { > return val.clone(); > } > > int dp = positives; > byte[] dst = StringUTF16.newBytesFor(val.length); > if (positives > 0) { > System.arraycopy(val, 0, dst, 0, positives); > } > for (int i=dp; i<val.length; ++i) { > byte c = val[i]; > if (c < 0) { > dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6)); > dst[dp++] = (byte) (0x80 | (c & 0x3f)); > } else { > dst[dp++] = c; > } > } > > > > I have done a bit of testing with the StringEncode jmh benchmark on my > local windows device. > > encodeLatin1LongEnd speeds up significantly (~70%) > encodeLatin1LongStart slows down (~20%) > encodeLatin1Mixed speeds up by ~30% > > The remaining tests do not show much difference either way. > > Brett > > > > > >