On Mon, 25 Sep 2023 12:28:40 GMT, Chen Liang <li...@openjdk.org> wrote:
> In the constructor of String, many locations the user-supplied byte or char > arrays are read multiple times with a plain memory access; if a user > previously wrote to one of such locations out of happens-before order, > distinct plain memory reads may result in different unanticipated values. > > The main problem caused by such error is that String constructor may > incorrectly produce a UTF16 coder string with all-LATIN1 compatible > characters when `COMPACT_STRING` is true, which breaks the contract of > String. (The error can happen the other way around, but the resulting LATIN1 > string is valid; this patch does not address that.) > > Thus, I modified the String data compression for non-trusted arrays: a LATIN1 > compression first-pass is still done, but if the first compression fails, a > second compression pass is done on a trusted (that is, copied from the > original data) data where reading would be consistent. The approach takes a > toll on UTF16 string construction time, but should not be more costly > memory-wise. > > A separate routine to decode UTF8 in String constructor that takes byte > encoding has the same multi-read problem, that the old `offset--` leads to a > problematic double read. This is resolved by copying the data to decode to a > local array at first instead of reading from the user-provided byte array. > This fix also costs more runtime but at no extra memory cost. > > Internal APIs such as newStringNoRepl are not guarded by this patch, as they > are already trusted to be immutable and unshared. > > `test/jdk/java/lang/String` tests pass. More testing is needed to see if > there are other edge cases not covered. > > Please review and don't hesitate to critique my approach and patch. Closing in favor of #16425. ------------- PR Comment: https://git.openjdk.org/jdk/pull/15902#issuecomment-1793259842