Re: Announcing Bytext

Glenn Maynard Sun, 03 Feb 2002 08:02:59 -0800

Ack.  This text isn't wrapped.  Quotes-at-the-bottom were bad enough.

On Sun, Feb 03, 2002 at 05:57:28AM -0800, Bernard Miller wrote:
> I don’t mean to imply that David Starner is an idiot, there are other reasons why 
>someone might not understand Bytext, from the trivial (no time, no interest) to the 
>less trivial (learning styles, documentation errors, etc). I’ve changed some 
>wording based on his specific concerns and I would appreciate other specific 
>comments. The issue with the statements he quotes is basically that Bytext strips 
>various characters like combining characters of all their properties except for a 
>name and a property that maps them to a Unicode character. So combining characters 
>sort of exist (for compatibility) and sort of do not (as fully defined characters). 
>Bytext can be thought of as an excercise in massive precomposition, an attempt to 
>eliminate the need for combining characters and formatting characters and grapheme 
>clusters. Precomposition is the spirit of the W3C character model, Bytext simply 
>takes this to it’s logical conclusion. It simplifies many text processes, 
>especially for syllable oriented scripts like Devanagari. It may seem to involve too 
>many characters, but it is finite and thus considerably less than the infinite number 
>of abstract characters in Unicode. Also, there is a logic to the way the characters 
>are formed with bytes that makes it easy to process algorithmically, it’s not just 
>a huge list of characters.


If this format isn't two-way compatible with Unicode (as well as all of
the major character sets Unicode is two-way compatible with), it's got
another compatibility strike against it.

> About people having an emotional attachment to Unicode, I’m not necessarily 
>referring to people on this thread. Perhaps David has emotional issues with bad 
>typography, maybe he was abused as a child by poor documentation ;-)  Nah, but what 
>else other than emotion can explain it when minor spelling errors are characterized 
>as “inconsistencies” (nevermind that Bytext errata has no place in the Unicode 
>mailing list); or the various hostile comments only minutes after it was announced; 
>or the knee-jerk ridicule of new characters I proposed which later received serious 
>consideration by other members; or the many people who took offense at the mere 
>implication that they should find it interesting? Character encoding as a science is 
>kind of like arithmetic, one doesn’t expect a lot of major new developments --but 
>things like lambda calculus still come along many years later. If someone 
>implementing an arithmetic library doesn’t even find lambda calculus interesting 
>and refuses to even read about it, I would say something is missing from that person, 
>perhaps they are prime candidates for being replaced by a robot. The same goes with 
>those that are implementing Unicode...

I saw no knee-jerk responses on this list, and this is the one I
currently read.

> As for ASCII transparency (a more appropriate word than compatibility) and the 
>general notion of how complex Bytext is compared to Unicode, there are 2 important 
>concepts to take note of: The first is that making things easier for the user will 
>USUALLY involve making things more difficult for the developer. You can’t expect a 
>user to shed a tear for a developer, the user simply wants the best thing possible. 
>Surely no one is suggesting that Bytext is IMPOSSIBLE. It is a headache to implement 
>any new feature, but it is also an opportunity for growth.

Lack of that means lack of an upgrade path, which means impracticality.

> I propose that fast and intuitive regular expressions are a feature that will not 
>lose importance because no matter how fast computers get, the amount of data that 
>needs to be searched can easily grow even faster. The first step of any search, even 
>a database search, is a regex. Not only do regexes need to be fast, they also need to 
>be intuitive because nowadays regexes are composed by ordinary people. Open 
>composition searching (a feature of Bytext) is incredibly intuitive: you can search 
>for components of characters the same way you can search for components of words. It 
>all but eliminates the need for case folding or what might be called “diacritic 
>folding”. It also puts native Unix technologies (8 bit regexes) back in the 
>forefront. If you want another compelling reason for Bytext, read the section on OBS 
>characters. 

Impracticality will kill any format, regardless of what it provides.

> The other thing to take note of is the notion of absolute complexity vs relative 
>complexity. Because of the lack of ASCII transparency, Bytext may be arguably more 
>difficult to implement on a trival level than UTF-8 on ASCII based systems (it may 
>have more relative complexity). But consider that many peoples of the world actually 
>want to use their native scripts in protocols and functions. To say that being able 
>to automatically ignore non ASCII codes is why UTF-8 is better is an affront. 

An affront?  It's a purely practical matter.  Unixes in general are 7-bit
ASCII by default.  Nobody is suggesting to ignore non-ASCII codes when a
program is multibyte-aware, we're saying that you don't have to convert
*all of your programs at once* to use it at all, which you have to do
for all ASCII-incompatible character sets.  Doing that would be
completely impossible, and, in the real world, simply won't happen.

> Not doing a proper conversion of charsets is clearly a hack, like programming 
>without type safety --not always a bad thing but certainly shouldn’t be imposed on 
>everyone. 

Incorrect.  Since ASCII is a subset of UTF-8, no conversion is necessary, so
any such conversion would simply do nothing.  (The same is true of all
character sets which are a superset of ASCII--which is most of them, for
the same practical reasons.)  This is by design, of course; ASCII is the
common denominator, which makes it possible to transition to more useful
character sets.

If a textfile is correct ASCII, and the user's locale is UTF-8, the textfile
is correct UTF-8; all of the available codepoints are well-defined and
nothing is assumed.  This isn't a hack, this is by the design of UTF-8 and
Unicode.

> Many of the elegant features of Unixes depend on the notion of 8 bit transparency: 
>pipe, cat, echo... the byte stream is the common denominator. The functions are 
>general purpose and thus more useful. Bytext takes this elegant notion to it’s 
>logical conclusion: not only can you process text as bytes, you can also process 
>bytes as text. By default, everything is preserved and there are no special sequences 
>to worry about. You can open ANY file as a text file and scan it for troubleshooting 
>information or just as a way of trying to visually deduce what kind of file it is. It 
>is useful to apply regular expressions and various functions like “diff” to 
>arbitrary binary data not only using the same familiar functions, but also within the 
>same familiar application --your text editor. 

In practice, it's no big deal to open binary files with a decent text editor.
Vim handles it just fine.  (The major issue is handling NULs, and that
won't be helped by any encoding.)

If I want to grep a binary file, I use "strings file | grep" (or "nm" or
whatever), and I only grep the stuff that's useful to grep.  Same for diff.

(David:)
> How about an example? Say, "ᎰᎵ hat Musik gut gehört." What does that
> look like bytewise in Bytext?

A distinct advantage of replying in the common style is it helps
responses; I don't think you answered this question, at least on this
list.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Announcing Bytext

Reply via email to