Bruno Haible wrote: > Jim Meyering wrote: >>> - Processing in unibyte locales should not become significantly slower >>> than before. >>> - Code duplication should be avoided, for maintainability. >>> - Macros which expand to one thing in the multibyte case and to another >>> thing for the unibyte case are not acceptable. >>> >>> How will this students' project solve this dilemma? >> There's no guarantee, but Paul and I will be supervising. > > I mean, what is technically the solution to the dilemma? The typical idiom > for keeping the speed of the unibyte case is - see e.g. > gnulib/lib/mbscasecmp.c > as an example - > > #if HAVE_MBRTOWC > if (MB_CUR_MAX > 1) > ... multibyte case ... > else > #endif > ... unibyte case ... > > but it does have code duplication.
That's the obvious solution that is not really required/desired. If I was being paid to do it (I have very little free time unfortunately), then I would do something like... 1. identify filters that require multibyte handling. 2. refactor line input processing etc. to shared code. 3. Intelligently apply multibyte processing. For illustration look at the performance various `uniq` implementations currently: $ rpm -q coreutils coreutils-6.9-9.fc8 $ echo $LANG en_IE.UTF-8 # The default one uses the existing i18n patch $ time uniq < lines.test > /dev/null real 0m27.724s $ time LC_CTYPE=C uniq < lines.test > /dev/null real 0m1.314s $time ~/git/coreutils/src/uniq < lines.test > /dev/null real 0m1.187s $ time ~/myuniq < lines.test > /dev/null real 0m0.827s $ time ~/uniq.py < lines.test > /dev/null real 0m2.657s Yes the python version (which I nearly wrote in the same time and the default uniq took to complete the test) is much better! `myuniq` is a version I implemented from scratch, to understand some of what the issues involved would be: http://lists.gnu.org/archive/html/bug-coreutils/2006-07/msg00153.html It's not just performance. The functionality of the i18n patch for uniq is buggy in the presence of NUL characters for example: for i in 1 2 3; do echo -e "1234\x0056789"; done | uniq 123456789 123456789 123456789 for i in 1 2 3; do echo -e "1234\x0056789"; done | LANG=C uniq 123456789 It's great that Paul & Jim are looking at this interesting project as it really is important as I've mentioned before. cheers, Pádraig. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils