Re: Reducing iconv-induced memory usage

Ludovic Courtès Wed, 27 Apr 2011 07:36:19 -0700

Hi Mark,

Mark H Weaver <m...@netris.org> writes:


> l...@gnu.org (Ludovic Courtès) writes:
>> So, here’s the patch.
>>
>> It also makes UTF-8 input ~30% faster according to ports.bm (which
>> doesn’t benchmark output):
>
> Thanks for working on this.  I haven't yet had time to fully review this
> patch, but here I will document the problems I see so far.

Thanks for the review!

> First of all, while looking at this patch, I've discovered another
> problem in ports.c: scm_char_ready_p does not consider the possibility
> of multibyte characters, and returns #t whenever there is at least one
> byte ready.

Indeed; let’s discuss it separately.

> Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
> non-shortest form.  For example, we must reject the byte sequence
> 0xC1 0x80 which a permissive reader would read as 0x40, since obviously
> that code point can be encoded as a single byte in UTF-8.
>
> We must also reject any UTF-8 byte sequence that corresponds to a
> surrogate code point (U+D800..U+DFFF), or to a code point greater than
> U+10FFFF.
>
> Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
> shows all well-formed UTF-8 byte sequences.  The asterisks highlight
> continuation bytes that are constrained to a smaller range than the
> usual 80..BF.
>
>    code points       byte[0]  byte[1]   byte[2]  byte[3]
> ---------------------------------------------------------
> U+000000..U+00007F | 00..7F |         |        |        |
> U+000080..U+0007FF | C2..DF | 80..BF  |        |        |
> U+000800..U+000FFF |   E0   | A0..BF* | 80..BF |        |
> U+001000..U+00CFFF | E1..EC | 80..BF  | 80..BF |        |
> U+00D000..U+00D7FF |   ED   | 80..9F* | 80..BF |        |
> U+00E000..U+00FFFF | EE..EF | 80..BF  | 80..BF |        |
> U+010000..U+03FFFF |   F0   | 90..BF* | 80..BF | 80..BF |
> U+040000..U+0FFFFF | F1..F3 | 80..BF  | 80..BF | 80..BF |
> U+100000..U+10FFFF |   F4   | 80..8F* | 80..BF | 80..BF |
> ---------------------------------------------------------
>
> So, for the code above corresponding to 2-byte sequences, it would
> suffice to verify that buf[0] >= 0xC2.  The 3- and 4-byte cases are
> somewhat more constrained.

Indeed, thanks for educating me.  ;-)

Could you add UTF-8 tests for such cases using the just-committed
‘test-decoding-error’ in ports.test?

>> +  else if ((buf[0] & 0xf0) == 0xe0)
>> +    {
>> +      byte = scm_get_byte_or_eof (port);
>> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
>> +    goto invalid_seq;
>> +
>> +      buf[1] = (scm_t_uint8) byte;
>> +      *len = 2;
>> +
>> +      byte = scm_get_byte_or_eof (port);
>> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
>> +    goto invalid_seq;
>> +
>> +      buf[2] = (scm_t_uint8) byte;
>> +      *len = 3;
>> +
>> +      *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
>> +    | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
>> +    | (buf[2] & 0x3f);
>> +    }
>> +  else
>> +    {
>
> That ^^^ should not simply be an "else".  It must check that the first
> byte is valid.

Right.

>> + invalid_seq:
>> +  /* Return the faulty byte.  */
>> +  scm_unget_byte (byte, port);
>
> This ungets only the last byte, but there may be up to 4 bytes to unget.

No, that’s done in ‘peek-char’.

>> +/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
>> +   with the byte representation of the codepoint in PORT's encoding, and
>> +   set *LEN to the length in bytes of that representation.  Return 0 on
>> +   success and an errno value on error.  */
>> +static int
>> +get_codepoint (SCM port, scm_t_wchar *codepoint,
>> +           char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>> +{
>> +  int err;
>> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +
>> +  if (pt->input_cd == (iconv_t) -1)
>> +    /* Initialize the conversion descriptors, if needed.  */
>> +    scm_i_set_port_encoding_x (port, pt->encoding);
>> +
>> +  if (pt->input_cd == (iconv_t) -1)
>> +    err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
>> +  else
>> +    err = get_iconv_codepoint (port, codepoint, buf, len);
>
> From the code above, it appears that for UTF-8 ports,
> scm_i_set_port_encoding_x will necessarily be called once per character
> read.  This seems rather inefficient.

Correct.  Alas, I don’t know how to avoid this inefficiency in 2.0 since
we can’t just add a flag in ‘scm_t_port’ since it would break the ABI.
Ideas?

Besides, however inefficient it may seem, it’s still more efficient than
what we currently have, as I explained.

> Also, if we wish to support Latin-1 without iconv as well, the simple
> method above will not work.

Why would we want such a thing?  :-)

The starting point for this patch was the observation that our Unicode
I/O converts to/from UTF-8, and then from UTF-8 to our internal
representation, and that it’s wasteful to use iconv to convert from
UTF-8 to UTF-8 when reading from/writing to a UTF-8 port.

> I would recommend adding an enum field to the port which for now only
> has two encoding schemes: ICONV or UTF8.  Later, we could add LATIN1 and
> maybe ASCII as well.  Given that this check must be done once per
> character, it seems better to do a switch on an enum than to strcmp with
> pt->encoding (as is done in scm_i_set_port_encoding_x).

Agreed; maybe something for ‘master’ once this version is in 2.0?

Thanks,
Ludo’.

Re: Reducing iconv-induced memory usage

Reply via email to