Hi Mark, Mark H Weaver <m...@netris.org> writes:
> l...@gnu.org (Ludovic Courtès) writes: >> So, here’s the patch. >> >> It also makes UTF-8 input ~30% faster according to ports.bm (which >> doesn’t benchmark output): > > Thanks for working on this. I haven't yet had time to fully review this > patch, but here I will document the problems I see so far. Thanks for the review! > First of all, while looking at this patch, I've discovered another > problem in ports.c: scm_char_ready_p does not consider the possibility > of multibyte characters, and returns #t whenever there is at least one > byte ready. Indeed; let’s discuss it separately. > Unicode requires that we reject as ill-formed any UTF-8 byte sequence in > non-shortest form. For example, we must reject the byte sequence > 0xC1 0x80 which a permissive reader would read as 0x40, since obviously > that code point can be encoded as a single byte in UTF-8. > > We must also reject any UTF-8 byte sequence that corresponds to a > surrogate code point (U+D800..U+DFFF), or to a code point greater than > U+10FFFF. > > Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely > shows all well-formed UTF-8 byte sequences. The asterisks highlight > continuation bytes that are constrained to a smaller range than the > usual 80..BF. > > code points byte[0] byte[1] byte[2] byte[3] > --------------------------------------------------------- > U+000000..U+00007F | 00..7F | | | | > U+000080..U+0007FF | C2..DF | 80..BF | | | > U+000800..U+000FFF | E0 | A0..BF* | 80..BF | | > U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | | > U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | | > U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | | > U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF | > U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | > U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF | > --------------------------------------------------------- > > So, for the code above corresponding to 2-byte sequences, it would > suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are > somewhat more constrained. Indeed, thanks for educating me. ;-) Could you add UTF-8 tests for such cases using the just-committed ‘test-decoding-error’ in ports.test? >> + else if ((buf[0] & 0xf0) == 0xe0) >> + { >> + byte = scm_get_byte_or_eof (port); >> + if (byte == EOF || ((byte & 0xc0) != 0x80)) >> + goto invalid_seq; >> + >> + buf[1] = (scm_t_uint8) byte; >> + *len = 2; >> + >> + byte = scm_get_byte_or_eof (port); >> + if (byte == EOF || ((byte & 0xc0) != 0x80)) >> + goto invalid_seq; >> + >> + buf[2] = (scm_t_uint8) byte; >> + *len = 3; >> + >> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL >> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL >> + | (buf[2] & 0x3f); >> + } >> + else >> + { > > That ^^^ should not simply be an "else". It must check that the first > byte is valid. Right. >> + invalid_seq: >> + /* Return the faulty byte. */ >> + scm_unget_byte (byte, port); > > This ungets only the last byte, but there may be up to 4 bytes to unget. No, that’s done in ‘peek-char’. >> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF >> + with the byte representation of the codepoint in PORT's encoding, and >> + set *LEN to the length in bytes of that representation. Return 0 on >> + success and an errno value on error. */ >> +static int >> +get_codepoint (SCM port, scm_t_wchar *codepoint, >> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) >> +{ >> + int err; >> + scm_t_port *pt = SCM_PTAB_ENTRY (port); >> + >> + if (pt->input_cd == (iconv_t) -1) >> + /* Initialize the conversion descriptors, if needed. */ >> + scm_i_set_port_encoding_x (port, pt->encoding); >> + >> + if (pt->input_cd == (iconv_t) -1) >> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); >> + else >> + err = get_iconv_codepoint (port, codepoint, buf, len); > > From the code above, it appears that for UTF-8 ports, > scm_i_set_port_encoding_x will necessarily be called once per character > read. This seems rather inefficient. Correct. Alas, I don’t know how to avoid this inefficiency in 2.0 since we can’t just add a flag in ‘scm_t_port’ since it would break the ABI. Ideas? Besides, however inefficient it may seem, it’s still more efficient than what we currently have, as I explained. > Also, if we wish to support Latin-1 without iconv as well, the simple > method above will not work. Why would we want such a thing? :-) The starting point for this patch was the observation that our Unicode I/O converts to/from UTF-8, and then from UTF-8 to our internal representation, and that it’s wasteful to use iconv to convert from UTF-8 to UTF-8 when reading from/writing to a UTF-8 port. > I would recommend adding an enum field to the port which for now only > has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and > maybe ASCII as well. Given that this check must be done once per > character, it seems better to do a switch on an enum than to strcmp with > pt->encoding (as is done in scm_i_set_port_encoding_x). Agreed; maybe something for ‘master’ once this version is in 2.0? Thanks, Ludo’.