I read http://www.cl.cam.ac.uk/~mgk25/unicode.html and did not find it to resolve the difficulties with doing fast regular expressions of multi-lingual text in UTF-8. Bytext (www.bytext.org) is a lot like the metric system or other clever standards like A4 paper, it has solid technical advantages that cannot be dismissed so easily by saying we have something that works already. There are always going to be those who insist that things get better over time.
Hopefully flags will go off when members of this list read things that are equivalent to "I don't understand it but here is my opinion on it...". Bytext is a superset of Unicode normalization form C, so it certainly encodes all of ASCII including form feed, and all combining characters. ASCII code points are rearranged partly so that characters like form feed can be quickly identified by normalization algorithms. This is far from "losing ASCII compatibility". It simply means that conversion must be proper, not simply ignoring certain ranges. Also, there is no need for a new primitive data type to support Bytext and certainly one is not "insisted" upon, merely recommended. It is perfectly reasonable to suppose that Bytext will never catch on, but the mere fact that it is in an embryonic stage of development is not PROOF that it will never catch on. Many people who seem to have an emotional attachment to Unicode seem to be providing this as the only evidence that Bytext is not worthwhile... as if how interesting something is should be directly related to how well devleoped and popular it is. Again, I hope flags go off. I love Linux and do not wish to disturb it's developers any further so perhaps this thread can continue off list for those interested. Those who want to compare Rosetta can find it here http://www.kotovnik.com/~avg/rosetta/. Bernard -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Markus Kuhn Sent: Friday, February 01, 2002 11:39 AM To: Bernard Miller Cc: [EMAIL PROTECTED] Subject: Re: Announcing Bytext "Bernard Miller" wrote on 2002-02-01 16:22 UTC: > Hello, > For those of you not already on the Unicode mailing list I thought you would > like to be aware of www.bytext.org. Bytext has a much better design than > Unicode and is a better long term solution. One of the main features is that > it is designed to be searchable with fast 8 bit regular expression > algorithms. You may want to build in some flexibility to deal with Bytext in > your implementation of UTF-8, perhaps even give up on UTF-8 altogether if it's > possible for you to focus on the long term. UCS has quite a number of historic oddities, no doubt, but at least we understand them rather well now and they are reasonably easy to work around. The way we have started to use UTF-8 on POSIX, GNU, Perl, etc. systems fixes already many of the same problems that Bytext tries to fix. Therefore, I don't see Bytext offering any so enormously significant practical advantage to consider it as a really serious alternative to UTF-8. I suspect that for our lifetime there was only one realistic moment in history to get the entire industry to agree onto a single coded character set architecture, and I fear Bytext comes pretty exactly 10 years too late here. http://www.cl.cam.ac.uk/~mgk25/unicode.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
