On Sat, Nov 03, 2001 at 06:16:32PM +0000, Markus Kuhn wrote: > On Sat, 3 Nov 2001, Eli Zaretskii wrote: > > > ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-4.html > > > > This is still silent about Grep, Sort, and tr, which are > > the utilities where the non-ASCII support should be a non-trivial > > change. > > > > Basically, even after reading that page (which told me something I > > didn't know in some cases), Unicode support in basic development > > tools is still very much rudimentary. > > In practice, Perl has long ago replaced grep, sort, tr, awk, for all but > sentimental reasons. Most of these little silly things were written as > inefficient separate C processes before 1975 for the sole reason that the > PDP-11 that Ritchie and Thompson used had only 64 kB RAM and couldn't > handle any larger multi-function tools: > > http://www.bell-labs.com/history/unix/ > http://www.bell-labs.com/history/unix/firstport.html > > Today, these tiny tools mostly lead people to write extremely inefficient > shell scripts that spend 90% of their time in fork(). > > UTF-8 support for Perl is in an advanced state, and for some more > experienced UTF-8 users, "grep", "sort", "tr", etc. are merely convenient > and nostalgic shell functions or scripts that call perl to do the job.
Remember that these tools are most often used today interactively; perl is useful for scripting, and is almost always better than an sh script, but I often don't want to use it interactively. grep -i "text.*pattern" file | sort | uniq How do you do that in Perl? I'm sure it's not hard, but it's almost certainly more code--and even if it was faster, it's not worth the (very minor) savings to learn how. Perl's a useful tool, but not for all tasks and not for all people. Bear in mind that even if you can make a convincing argument that these programs are obsolete (and I maintain that you can't--not for all of them, though I'd agree in the case of awk), most Unix programmers are still going to want them--so they must support UTF-8 for it to be fully usable by them. It's hard enough to convince some people to take the time to switch to UTF-8, without trying to convince them never to use grep at the same time. That said, egrep does have full locale support. You need the current beta snapshot for multibyte encodings to work, and it's currently painfully slow compared to singlebyte encodings--many times slower, in fact. Hopefully that'll change; the best thing about egrep is its speed. (Hopefully that's why this support hasn't been released yet.) tr still doesn't appear to work: tr "a" "さ" a ? 03:30pm [EMAIL PROTECTED]/6 [~] tr --version tr (GNU textutils) 2.0 I'll leave testing sort to someone else; I don't know what kind of test would show problems. (I think I remember reading that UTF-8 maintains collation order from UCS-4, and I'm not sure how that would affect a test set.) -- Glenn Maynard - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
