On Mon, 25 Sep 2023 12:28:40 GMT, Chen Liang <li...@openjdk.org> wrote:

> In the constructor of String, many locations the user-supplied byte or char 
> arrays are read multiple times with a plain memory access; if a user 
> previously wrote to one of such locations out of happens-before order, 
> distinct plain memory reads may result in different unanticipated values.
> 
> The main problem caused by such error is that String constructor may 
> incorrectly produce a UTF16 coder string with all-LATIN1 compatible 
> characters when `COMPACT_STRING` is true, which breaks the contract of 
> String. (The error can happen the other way around, but the resulting LATIN1 
> string is valid; this patch does not address that.)
> 
> Thus, I modified the String data compression for non-trusted arrays: a LATIN1 
> compression first-pass is still done, but if the first compression fails, a 
> second compression pass is done on a trusted (that is, copied from the 
> original data) data where reading would be consistent. The approach takes a 
> toll on UTF16 string construction time, but should not be more costly 
> memory-wise.
> 
> A separate routine to decode UTF8 in String constructor that takes byte 
> encoding has the same multi-read problem, that the old `offset--` leads to a 
> problematic double read. This is resolved by copying the data to decode to a 
> local array at first instead of reading from the user-provided byte array. 
> This fix also costs more runtime but at no extra memory cost.
> 
> Internal APIs such as newStringNoRepl are not guarded by this patch, as they 
> are already trusted to be immutable and unshared.
> 
> `test/jdk/java/lang/String` tests pass. More testing is needed to see if 
> there are other edge cases not covered.
> 
> Please review and don't hesitate to critique my approach and patch.

A side comment: I don't think it's problematic if we incorrectly create a 
LATIN1 String (such as by downcasting a char to byte), for such a String is 
valid and it's user's fault (for not publishing their writes to the array in 
happens-before order). We only think about avoiding writing a values array with 
no significant byte: we can just write any non-trivial 2-byte into UTF16 to 
make it valid, and that's why I'm looking for compress to return a `twoByte << 
32 | consumedLength`. Such a task of writing a 2-byte should be easy to 
accomplish.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15902#issuecomment-1733990052

Reply via email to