I read http://www.cl.cam.ac.uk/~mgk25/unicode.html and did not find it to
resolve the difficulties with doing fast regular expressions of
multi-lingual text in UTF-8. Bytext (www.bytext.org) is a lot like the
metric system or other clever standards like A4 paper, it has solid
technical advantages that cannot be dismissed so easily by saying we have
something that works already. There are always going to be those who insist
that things get better over time.

Hopefully flags will go off when members of this list read things that are
equivalent to "I don't understand it but here is my opinion on it...".
Bytext is a superset of Unicode normalization form C, so it certainly
encodes all of ASCII including form feed, and all combining characters.
ASCII code points are rearranged partly so that characters like form feed
can be quickly identified by normalization algorithms. This is far from
"losing ASCII compatibility". It simply means that conversion must be
proper, not simply ignoring certain ranges. Also, there is no need for a new
primitive data type to support Bytext and certainly one is not "insisted"
upon, merely recommended.

It is perfectly reasonable to suppose that Bytext will never catch on, but
the mere fact that it is in an embryonic stage of development is not PROOF
that it will never catch on. Many people who seem to have an emotional
attachment to Unicode seem to be providing this as the only evidence that
Bytext is not worthwhile... as if how interesting something is should be
directly related to how well devleoped and popular it is. Again, I hope
flags go off.

I love Linux and do not wish to disturb it's developers any further so
perhaps this thread can continue off list for those interested. Those who
want to compare Rosetta can find it here
http://www.kotovnik.com/~avg/rosetta/.

Bernard

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Markus Kuhn
Sent: Friday, February 01, 2002 11:39 AM
To: Bernard Miller
Cc: [EMAIL PROTECTED]
Subject: Re: Announcing Bytext


"Bernard Miller" wrote on 2002-02-01 16:22 UTC:
> Hello,
> For those of you not already on the Unicode mailing list I thought you
would
> like to be aware of www.bytext.org. Bytext has a much better design than
> Unicode and is a better long term solution. One of the main features is
that
> it is designed to be searchable with fast 8 bit regular expression
> algorithms. You may want to build in some flexibility to deal with Bytext
in
> your implementation of UTF-8, perhaps even give up on UTF-8 altogether if
it's
> possible for you to focus on the long term.

UCS has quite a number of historic oddities, no doubt, but at least we
understand them rather well now and they are reasonably easy to work
around. The way we have started to use UTF-8 on POSIX, GNU, Perl, etc.
systems fixes already many of the same problems that Bytext tries to
fix. Therefore, I don't see Bytext offering any so enormously
significant practical advantage to consider it as a really serious
alternative to UTF-8.

I suspect that for our lifetime there was only one realistic moment in
history to get the entire industry to agree onto a single coded
character set architecture, and I fear Bytext comes pretty exactly 10
years too late here.

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to