I don’t mean to imply that David Starner is an idiot, there are other reasons why someone might not understand Bytext, from the trivial (no time, no interest) to the less trivial (learning styles, documentation errors, etc). I’ve changed some wording based on his specific concerns and I would appreciate other specific comments. The issue with the statements he quotes is basically that Bytext strips various characters like combining characters of all their properties except for a name and a property that maps them to a Unicode character. So combining characters sort of exist (for compatibility) and sort of do not (as fully defined characters). Bytext can be thought of as an excercise in massive precomposition, an attempt to eliminate the need for combining characters and formatting characters and grapheme clusters. Precomposition is the spirit of the W3C character model, Bytext simply takes this to it’s logical conclusion. It simplifies many text processes, especially for syllable oriented scripts like Devanagari. It may seem to involve too many characters, but it is finite and thus considerably less than the infinite number of abstract characters in Unicode. Also, there is a logic to the way the characters are formed with bytes that makes it easy to process algorithmically, it’s not just a huge list of characters.
About people having an emotional attachment to Unicode, I’m not necessarily referring to people on this thread. Perhaps David has emotional issues with bad typography, maybe he was abused as a child by poor documentation ;-) Nah, but what else other than emotion can explain it when minor spelling errors are characterized as “inconsistencies” (nevermind that Bytext errata has no place in the Unicode mailing list); or the various hostile comments only minutes after it was announced; or the knee-jerk ridicule of new characters I proposed which later received serious consideration by other members; or the many people who took offense at the mere implication that they should find it interesting? Character encoding as a science is kind of like arithmetic, one doesn’t expect a lot of major new developments --but things like lambda calculus still come along many years later. If someone implementing an arithmetic library doesn’t even find lambda calculus interesting and refuses to even read about it, I would say something is missing from that person, perhaps they are prime candidates for being replaced by a robot. The same goes with those that are implementing Unicode... As for ASCII transparency (a more appropriate word than compatibility) and the general notion of how complex Bytext is compared to Unicode, there are 2 important concepts to take note of: The first is that making things easier for the user will USUALLY involve making things more difficult for the developer. You can’t expect a user to shed a tear for a developer, the user simply wants the best thing possible. Surely no one is suggesting that Bytext is IMPOSSIBLE. It is a headache to implement any new feature, but it is also an opportunity for growth. I propose that fast and intuitive regular expressions are a feature that will not lose importance because no matter how fast computers get, the amount of data that needs to be searched can easily grow even faster. The first step of any search, even a database search, is a regex. Not only do regexes need to be fast, they also need to be intuitive because nowadays regexes are composed by ordinary people. Open composition searching (a feature of Bytext) is incredibly intuitive: you can search for components of characters the same way you can search for components of words. It all but eliminates the need for case folding or what might be called “diacritic folding”. It also puts native Unix technologies (8 bit regexes) back in the forefront. If you want another compelling reason for Bytext, read the section on OBS characters. The other thing to take note of is the notion of absolute complexity vs relative complexity. Because of the lack of ASCII transparency, Bytext may be arguably more difficult to implement on a trival level than UTF-8 on ASCII based systems (it may have more relative complexity). But consider that many peoples of the world actually want to use their native scripts in protocols and functions. To say that being able to automatically ignore non ASCII codes is why UTF-8 is better is an affront. Not doing a proper conversion of charsets is clearly a hack, like programming without type safety --not always a bad thing but certainly shouldn’t be imposed on everyone. In absolute terms of complexity, Bytext is much simpler than Unicode. All Bytext character properties are modelled by a single table. The bidi algorithm and the line breaking algorithm are vastly simplified. The titlecase property is eliminated while retaining it’s functionality. East Asian Width properties go from being described in an entire technical report with 6 properties to being equivalently described by a single paragraph and a single property. Consider the complications with the new grapheme joining control character, and the various normative spelling conventions for syllables. Consider the many Unicode technical reports, the 850 page book, the many files of the Unicode database.. Consider the tremendous complexity of Unicode regex libraries --still not able to do simple things like automatically search for all variants of a syllable. Truly, it is hard to imagine how Unicode could be made any more complex. Many of the elegant features of Unixes depend on the notion of 8 bit transparency: pipe, cat, echo... the byte stream is the common denominator. The functions are general purpose and thus more useful. Bytext takes this elegant notion to it’s logical conclusion: not only can you process text as bytes, you can also process bytes as text. By default, everything is preserved and there are no special sequences to worry about. You can open ANY file as a text file and scan it for troubleshooting information or just as a way of trying to visually deduce what kind of file it is. It is useful to apply regular expressions and various functions like “diff” to arbitrary binary data not only using the same familiar functions, but also within the same familiar application --your text editor. Bernard -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of David Starner Sent: Saturday, February 02, 2002 9:59 AM To: [EMAIL PROTECTED] Cc: Bernard Miller Subject: Re: Announcing Bytext On Sat, Feb 02, 2002 at 02:16:37AM -0800, Bernard Miller wrote: > Hopefully flags will go off when members of this list read things that are > equivalent to "I don't understand it but here is my opinion on it...". I'm a very intellegent person. If I can't understand your document after reading through it a couple times, then the fault is yours. > Bytext is a superset of Unicode normalization form C, so it certainly > encodes all of ASCII including form feed, and all combining characters. Page two, first paragraph: "There are no surrogate pairs; no combining characters". Page one, last paragraph: "In particular, the role of Bytext is clearly separated from the role of markup. This is contrasted with many features of Unicode such as interlinear annotation characters (U+FFF9..U+FFFB); the object replacement character (U+FFFC); the nesting bidirectional control characters; and the "Tag Characters" (U+E0000..U+E007F)." Page 37, fourth paragraph: "Unlike Unicode, Bytext does not recognize any "page break" type of control character even as an informative property because page formatting is definitely in the domain of markup. Since FF looks like a PEC in screen display it can produce unanticipated results that are very frustrating if it causes unnecessary pages to print, wasting paper." (It doesn't say that it doesn't have a character named FORM FEED, but it does say that it doesn't have a page break character - i.e. a form feed.) > ASCII code points are rearranged partly so that characters like form feed > can be quickly identified by normalization algorithms. This is far from > "losing ASCII compatibility". If you're going to rearrange them, they might as well go away - many were confusing with better alternatives around anyway. But moving them breaks every program that depended on the binary value of ASCII, which we have more or less guarenteed will stay stable under Unix. Unicode is almost always encoded as UTF-8 under Unix for one reason - because 0x00-0x7f is 0x00-0x7f in ASCII. No surprises. > Also, there is no need for a new > primitive data type to support Bytext and certainly one is not "insisted" > upon, merely recommended. Sorry, recommended. Page 6, first paragraph: "It is recommended that "uByte", as in "unsigned byte", be the name of the data type that is used to store each byte that is to be interpreted as Bytext. It is also recommended that each uByte be a primitive data type in programming languages that have primitive types." > It is perfectly reasonable to suppose that Bytext will never catch on, but > the mere fact that it is in an embryonic stage of development is not PROOF > that it will never catch on. The fact that your standard is incoherant and you have no corporate or govermental support is strong evidence that it will never catch on. The fact that it's a new startup in a field of preexisting strong contenders puts the nail in casket. If you want to do a grassroots project, it has to be clear and attractive, and fill a needed gap. Yet another universal charset standard is not a needed gap. > I love Linux As can be noted by the Microsoft Word document and lack of HTML or plain text. > and do not wish to disturb it's developers any further so > perhaps this thread can continue off list for those interested. I see no reason to move it, personally; it's not too far off topic for the list. Defend your system in public. If it is to become successful, it will have to be defended in public, with understandings of and answers for the standard arguments. How about an example? Say, "ᎰᎵ hat Musik gut gehört." What does that look like bytewise in Bytext? -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc." -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
