On 07/11/2017 08:46 PM, Thomas Huth wrote:
+ s1 = cpu_ldub_data_ra(env, addr + 1, ra); + s2 = cpu_ldub_data_ra(env, addr + 2, ra); + c = s0 & 0x0f; + c = (c << 6) | (s1 & 0x3f); + c = (c << 6) | (s2 & 0x3f); + /* Fold the byte-by-byte range descriptions in the PoO into + tests against the complete value. It disallows encodings + that could be smaller, and the UTF-16 surrogates. */ + if (enh_check + && ((s1 & 0xc0) != 0x80 + || (s2 & 0xc0) != 0x80 + || c < 0x1000 + || (c >= 0xd800 && c <= 0xdfff))) { + return 2; + } + } else if (s0 <= (enh_check ? 0xf4 : 0xf7)) { + /* four byte character */ + l = 4; + if (ilen < 4) { + return 0; + } + s1 = cpu_ldub_data_ra(env, addr + 1, ra); + s2 = cpu_ldub_data_ra(env, addr + 2, ra); + s3 = cpu_ldub_data_ra(env, addr + 3, ra); + c = s0 & 0x0f;I think you could also use 0x07 instead of 0x0f here. Shouldn't matter much due to the s0 <= 0xf7 check above, though.
You're right that it doesn't matter due to the check, but fixed anyway. It makes the pattern wrt 2, 3, and 4 byte encodings more apparent.
r~
