Re: coreutils and i18n

Pádraig Brady Mon, 21 Apr 2008 05:04:18 -0700

Bruno Haible wrote:
> Jim Meyering wrote:
>>>   - Processing in unibyte locales should not become significantly slower
>>>     than before.
>>>   - Code duplication should be avoided, for maintainability.
>>>   - Macros which expand to one thing in the multibyte case and to another
>>>     thing for the unibyte case are not acceptable.
>>>
>>> How will this students' project solve this dilemma?
>> There's no guarantee, but Paul and I will be supervising.
> 
> I mean, what is technically the solution to the dilemma? The typical idiom
> for keeping the speed of the unibyte case is - see e.g. 
> gnulib/lib/mbscasecmp.c
> as an example -
> 
>   #if HAVE_MBRTOWC
>     if (MB_CUR_MAX > 1)
>       ... multibyte case ...
>     else
>   #endif
>       ... unibyte case ...
> 
> but it does have code duplication.


That's the obvious solution that is not really required/desired.

If I was being paid to do it (I have very little free time unfortunately),
then I would do something like...

1. identify filters that require multibyte handling.
2. refactor line input processing etc. to shared code.
3. Intelligently apply multibyte processing.

For illustration look at the performance various `uniq` implementations 
currently:

$ rpm -q coreutils
coreutils-6.9-9.fc8

$ echo $LANG
en_IE.UTF-8

# The default one uses the existing i18n patch
$ time uniq < lines.test > /dev/null
real    0m27.724s

$ time LC_CTYPE=C uniq < lines.test > /dev/null
real    0m1.314s

$time ~/git/coreutils/src/uniq < lines.test > /dev/null
real    0m1.187s

$ time ~/myuniq < lines.test > /dev/null
real    0m0.827s

$ time ~/uniq.py < lines.test > /dev/null
real    0m2.657s

Yes the python version (which I nearly wrote in the same
time and the default uniq took to complete the test) is much better!

`myuniq` is a version I implemented from scratch,
to understand some of what the issues involved would be:
http://lists.gnu.org/archive/html/bug-coreutils/2006-07/msg00153.html

It's not just performance. The functionality of the i18n patch for uniq
is buggy in the presence of NUL characters for example:

for i in 1 2 3; do echo -e "1234\x0056789"; done | uniq
123456789
123456789
123456789

for i in 1 2 3; do echo -e "1234\x0056789"; done | LANG=C uniq
123456789

It's great that Paul & Jim are looking at this interesting project
as it really is important as I've mentioned before.

cheers,
Pádraig.


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: coreutils and i18n

Reply via email to