Hi Pádraig,
Pádraig Brady <[email protected]> writes:
>>> Right. But that got me thinking that we could optimize
>>> in various cases, rather than resorting to mbsstr().
>>> The attached implements mbsmbchr(mbs, mbc) to more efficiently
>>> search for a multi-byte char in a multi-byte string,
>>> especially with the usual UTF-8 charset
>>> (which is determined with a single call to mbrtoc32() call per process).
>> I wonder if that function is worth putting in gl/ under LGPL in case
>> we
>> want to use it in other programs and/or move it to Gnulib. It seems
>> useful to me.
>
> Yes probably.
> I was going to look at maybe using it in cut(1) too,
> in which case it would definitely be appropriate to move to gl/
I was thinking about some i18n stuff today. A prerequisite to cut(1) is
getndelim2, which is probably the part that requires the most work.
For cut_fields, which uses getndelim2, I am thinking of a new function.
Something like this, which uses is_utf_charset() from numfmt:
ssize_t
mb_getndelim2 (char **lineptr, size_t *linesize, size_t offset,
size_t nmax, mcel_t delim1, mcel_t delim2,
FILE *stream)
{
if ((MB_CUR_LEN == 1 || delim1.ch < 0x30 && delim2.ch < 0x30)
|| (is_utf8_charset ()
&& delim1.ch < 0x80 && delim2.ch < 0x80)))
return getndelim2 (lineptr, linesize, offset, nmax, delim1.ch,
delim2.ch, stream);
mbbuf_t mbbuf;
char buffer[BUFSIZ];
mbbuf_init (&mbbuf, buffer, sizeof buffer, stream);
/* Read from the file using mbbuf_get_char until reaching a
demiliter, allocating and copying into LINEPTR as needed. */
return bytes_read;
}
That would allow us to avoid many mbrtowc calls when using LC_ALL=C (*)
or when using a UTF-8 locale with ASCII delimiters.
WDYT?
Collin
(*) Ignoring systems where the C locale is UTF-8.