Re: UTF-8 support for the ancient shell toys

Glenn Maynard Sat, 03 Nov 2001 12:49:02 -0800

On Sat, Nov 03, 2001 at 06:16:32PM +0000, Markus Kuhn wrote:
> On Sat, 3 Nov 2001, Eli Zaretskii wrote:
> > > ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-4.html
> >
> > This is still silent about Grep, Sort, and tr, which are
> > the utilities where the non-ASCII support should be a non-trivial
> > change.
> >
> > Basically, even after reading that page (which told me something I
> > didn't know in some cases), Unicode support in basic development
> > tools is still very much rudimentary.
> 
> In practice, Perl has long ago replaced grep, sort, tr, awk, for all but
> sentimental reasons. Most of these little silly things were written as
> inefficient separate C processes before 1975 for the sole reason that the
> PDP-11 that Ritchie and Thompson used had only 64 kB RAM and couldn't
> handle any larger multi-function tools:
> 
> http://www.bell-labs.com/history/unix/
> http://www.bell-labs.com/history/unix/firstport.html
> 
> Today, these tiny tools mostly lead people to write extremely inefficient
> shell scripts that spend 90% of their time in fork().
> 
> UTF-8 support for Perl is in an advanced state, and for some more
> experienced UTF-8 users, "grep", "sort", "tr", etc. are merely convenient
> and nostalgic shell functions or scripts that call perl to do the job.


Remember that these tools are most often used today interactively; perl is
useful for scripting, and is almost always better than an sh script, but I
often don't want to use it interactively.  

grep -i "text.*pattern" file | sort | uniq

How do you do that in Perl?  I'm sure it's not hard, but it's almost
certainly more code--and even if it was faster, it's not worth the (very
minor) savings to learn how.  Perl's a useful tool, but not for all
tasks and not for all people. 

Bear in mind that even if you can make a convincing argument that these
programs are obsolete (and I maintain that you can't--not for all of
them, though I'd agree in the case of awk), most Unix programmers are
still going to want them--so they must support UTF-8 for it to be fully
usable by them.  It's hard enough to convince some people to take the
time to switch to UTF-8, without trying to convince them never to use
grep at the same time.

That said, egrep does have full locale support.  You need the current
beta snapshot for multibyte encodings to work, and it's currently
painfully slow compared to singlebyte encodings--many times slower, in
fact.  Hopefully that'll change; the best thing about egrep is its
speed.  (Hopefully that's why this support hasn't been released yet.)

tr still doesn't appear to work:

tr "a" "さ"
a
?

03:30pm [EMAIL PROTECTED]/6 [~] tr --version
tr (GNU textutils) 2.0

I'll leave testing sort to someone else; I don't know what kind of test
would show problems.  (I think I remember reading that UTF-8 maintains
collation order from UCS-4, and I'm not sure how that would affect a
test set.)

-- 
Glenn Maynard
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 support for the ancient shell toys

Reply via email to