Ack. This text isn't wrapped. Quotes-at-the-bottom were bad enough. On Sun, Feb 03, 2002 at 05:57:28AM -0800, Bernard Miller wrote: > I don’t mean to imply that David Starner is an idiot, there are other reasons why >someone might not understand Bytext, from the trivial (no time, no interest) to the >less trivial (learning styles, documentation errors, etc). I’ve changed some >wording based on his specific concerns and I would appreciate other specific >comments. The issue with the statements he quotes is basically that Bytext strips >various characters like combining characters of all their properties except for a >name and a property that maps them to a Unicode character. So combining characters >sort of exist (for compatibility) and sort of do not (as fully defined characters). >Bytext can be thought of as an excercise in massive precomposition, an attempt to >eliminate the need for combining characters and formatting characters and grapheme >clusters. Precomposition is the spirit of the W3C character model, Bytext simply >takes this to it’s logical conclusion. It simplifies many text processes, >especially for syllable oriented scripts like Devanagari. It may seem to involve too >many characters, but it is finite and thus considerably less than the infinite number >of abstract characters in Unicode. Also, there is a logic to the way the characters >are formed with bytes that makes it easy to process algorithmically, it’s not just >a huge list of characters.
If this format isn't two-way compatible with Unicode (as well as all of the major character sets Unicode is two-way compatible with), it's got another compatibility strike against it. > About people having an emotional attachment to Unicode, I’m not necessarily >referring to people on this thread. Perhaps David has emotional issues with bad >typography, maybe he was abused as a child by poor documentation ;-) Nah, but what >else other than emotion can explain it when minor spelling errors are characterized >as “inconsistencies” (nevermind that Bytext errata has no place in the Unicode >mailing list); or the various hostile comments only minutes after it was announced; >or the knee-jerk ridicule of new characters I proposed which later received serious >consideration by other members; or the many people who took offense at the mere >implication that they should find it interesting? Character encoding as a science is >kind of like arithmetic, one doesn’t expect a lot of major new developments --but >things like lambda calculus still come along many years later. If someone >implementing an arithmetic library doesn’t even find lambda calculus interesting >and refuses to even read about it, I would say something is missing from that person, >perhaps they are prime candidates for being replaced by a robot. The same goes with >those that are implementing Unicode... I saw no knee-jerk responses on this list, and this is the one I currently read. > As for ASCII transparency (a more appropriate word than compatibility) and the >general notion of how complex Bytext is compared to Unicode, there are 2 important >concepts to take note of: The first is that making things easier for the user will >USUALLY involve making things more difficult for the developer. You can’t expect a >user to shed a tear for a developer, the user simply wants the best thing possible. >Surely no one is suggesting that Bytext is IMPOSSIBLE. It is a headache to implement >any new feature, but it is also an opportunity for growth. Lack of that means lack of an upgrade path, which means impracticality. > I propose that fast and intuitive regular expressions are a feature that will not >lose importance because no matter how fast computers get, the amount of data that >needs to be searched can easily grow even faster. The first step of any search, even >a database search, is a regex. Not only do regexes need to be fast, they also need to >be intuitive because nowadays regexes are composed by ordinary people. Open >composition searching (a feature of Bytext) is incredibly intuitive: you can search >for components of characters the same way you can search for components of words. It >all but eliminates the need for case folding or what might be called “diacritic >folding”. It also puts native Unix technologies (8 bit regexes) back in the >forefront. If you want another compelling reason for Bytext, read the section on OBS >characters. Impracticality will kill any format, regardless of what it provides. > The other thing to take note of is the notion of absolute complexity vs relative >complexity. Because of the lack of ASCII transparency, Bytext may be arguably more >difficult to implement on a trival level than UTF-8 on ASCII based systems (it may >have more relative complexity). But consider that many peoples of the world actually >want to use their native scripts in protocols and functions. To say that being able >to automatically ignore non ASCII codes is why UTF-8 is better is an affront. An affront? It's a purely practical matter. Unixes in general are 7-bit ASCII by default. Nobody is suggesting to ignore non-ASCII codes when a program is multibyte-aware, we're saying that you don't have to convert *all of your programs at once* to use it at all, which you have to do for all ASCII-incompatible character sets. Doing that would be completely impossible, and, in the real world, simply won't happen. > Not doing a proper conversion of charsets is clearly a hack, like programming >without type safety --not always a bad thing but certainly shouldn’t be imposed on >everyone. Incorrect. Since ASCII is a subset of UTF-8, no conversion is necessary, so any such conversion would simply do nothing. (The same is true of all character sets which are a superset of ASCII--which is most of them, for the same practical reasons.) This is by design, of course; ASCII is the common denominator, which makes it possible to transition to more useful character sets. If a textfile is correct ASCII, and the user's locale is UTF-8, the textfile is correct UTF-8; all of the available codepoints are well-defined and nothing is assumed. This isn't a hack, this is by the design of UTF-8 and Unicode. > Many of the elegant features of Unixes depend on the notion of 8 bit transparency: >pipe, cat, echo... the byte stream is the common denominator. The functions are >general purpose and thus more useful. Bytext takes this elegant notion to it’s >logical conclusion: not only can you process text as bytes, you can also process >bytes as text. By default, everything is preserved and there are no special sequences >to worry about. You can open ANY file as a text file and scan it for troubleshooting >information or just as a way of trying to visually deduce what kind of file it is. It >is useful to apply regular expressions and various functions like “diff” to >arbitrary binary data not only using the same familiar functions, but also within the >same familiar application --your text editor. In practice, it's no big deal to open binary files with a decent text editor. Vim handles it just fine. (The major issue is handling NULs, and that won't be helped by any encoding.) If I want to grep a binary file, I use "strings file | grep" (or "nm" or whatever), and I only grep the stuff that's useful to grep. Same for diff. (David:) > How about an example? Say, "ᎰᎵ hat Musik gut gehört." What does that > look like bytewise in Bytext? A distinct advantage of replying in the common style is it helps responses; I don't think you answered this question, at least on this list. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
