On Mon, 25 Sep 2023 12:28:40 GMT, Chen Liang <li...@openjdk.org> wrote:

> In the constructor of String, many locations the user-supplied byte or char 
> arrays are read multiple times with a plain memory access; if a user 
> previously wrote to one of such locations out of happens-before order, 
> distinct plain memory reads may result in different unanticipated values.
> 
> The main problem caused by such error is that String constructor may 
> incorrectly produce a UTF16 coder string with all-LATIN1 compatible 
> characters when `COMPACT_STRING` is true, which breaks the contract of 
> String. (The error can happen the other way around, but the resulting LATIN1 
> string is valid; this patch does not address that.)
> 
> Thus, I modified the String data compression for non-trusted arrays: a LATIN1 
> compression first-pass is still done, but if the first compression fails, a 
> second compression pass is done on a trusted (that is, copied from the 
> original data) data where reading would be consistent. The approach takes a 
> toll on UTF16 string construction time, but should not be more costly 
> memory-wise.
> 
> A separate routine to decode UTF8 in String constructor that takes byte 
> encoding has the same multi-read problem, that the old `offset--` leads to a 
> problematic double read. This is resolved by copying the data to decode to a 
> local array at first instead of reading from the user-provided byte array. 
> This fix also costs more runtime but at no extra memory cost.
> 
> Internal APIs such as newStringNoRepl are not guarded by this patch, as they 
> are already trusted to be immutable and unshared.
> 
> `test/jdk/java/lang/String` tests pass. More testing is needed to see if 
> there are other edge cases not covered.
> 
> Please review and don't hesitate to critique my approach and patch.

Closing in favor of #16425.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15902#issuecomment-1793259842

Reply via email to