RE: Announcing Bytext

Bernard Miller Sun, 03 Feb 2002 05:36:09 -0800

I don’t mean to imply that David Starner is an idiot, there are other reasons why 
someone might not understand Bytext, from the trivial (no time, no interest) to the 
less trivial (learning styles, documentation errors, etc). I’ve changed some wording 
based on his specific concerns and I would appreciate other specific comments. The 
issue with the statements he quotes is basically that Bytext strips various characters 
like combining characters of all their properties except for a name and a property 
that maps them to a Unicode character. So combining characters sort of exist (for 
compatibility) and sort of do not (as fully defined characters). Bytext can be thought 
of as an excercise in massive precomposition, an attempt to eliminate the need for 
combining characters and formatting characters and grapheme clusters. Precomposition 
is the spirit of the W3C character model, Bytext simply takes this to it’s logical 
conclusion. It simplifies many text processes, especially for syllable oriented 
scripts like Devanagari. It may seem to involve too many characters, but it is finite 
and thus considerably less than the infinite number of abstract characters in Unicode. 
Also, there is a logic to the way the characters are formed with bytes that makes it 
easy to process algorithmically, it’s not just a huge list of characters.

About people having an emotional attachment to Unicode, I’m not necessarily 
referring to people on this thread. Perhaps David has emotional issues with bad 
typography, maybe he was abused as a child by poor documentation ;-)  Nah, but what 
else other than emotion can explain it when minor spelling errors are characterized as 
“inconsistencies” (nevermind that Bytext errata has no place in the Unicode 
mailing list); or the various hostile comments only minutes after it was announced; or 
the knee-jerk ridicule of new characters I proposed which later received serious 
consideration by other members; or the many people who took offense at the mere 
implication that they should find it interesting? Character encoding as a science is 
kind of like arithmetic, one doesn’t expect a lot of major new developments --but 
things like lambda calculus still come along many years later. If someone implementing 
an arithmetic library doesn’t even find lambda calculus interesting and refuses to 
even read about it, I would say something is missing from that person, perhaps they 
are prime candidates for being replaced by a robot. The same goes with those that are 
implementing Unicode...

As for ASCII transparency (a more appropriate word than compatibility) and the general 
notion of how complex Bytext is compared to Unicode, there are 2 important concepts to 
take note of: The first is that making things easier for the user will USUALLY involve 
making things more difficult for the developer. You can’t expect a user to shed a 
tear for a developer, the user simply wants the best thing possible. Surely no one is 
suggesting that Bytext is IMPOSSIBLE. It is a headache to implement any new feature, 
but it is also an opportunity for growth. I propose that fast and intuitive regular 
expressions are a feature that will not lose importance because no matter how fast 
computers get, the amount of data that needs to be searched can easily grow even 
faster. The first step of any search, even a database search, is a regex. Not only do 
regexes need to be fast, they also need to be intuitive because nowadays regexes are 
composed by ordinary people. Open composition searching (a feature of Bytext) is 
incredibly intuitive: you can search for components of characters the same way you can 
search for components of words. It all but eliminates the need for case folding or 
what might be called “diacritic folding”. It also puts native Unix technologies (8 
bit regexes) back in the forefront. If you want another compelling reason for Bytext, 
read the section on OBS characters. 

The other thing to take note of is the notion of absolute complexity vs relative 
complexity. Because of the lack of ASCII transparency, Bytext may be arguably more 
difficult to implement on a trival level than UTF-8 on ASCII based systems (it may 
have more relative complexity). But consider that many peoples of the world actually 
want to use their native scripts in protocols and functions. To say that being able to 
automatically ignore non ASCII codes is why UTF-8 is better is an affront. Not doing a 
proper conversion of charsets is clearly a hack, like programming without type safety 
--not always a bad thing but certainly shouldn’t be imposed on everyone. In absolute 
terms of complexity, Bytext is much simpler than Unicode. All Bytext character 
properties are modelled by a single table. The bidi algorithm and the line breaking 
algorithm are vastly simplified. The titlecase property is eliminated while retaining 
it’s functionality. East Asian Width properties go from being described in an entire 
technical report with 6 properties to being equivalently described by a single 
paragraph and a single property. Consider the complications with the new grapheme 
joining control character, and the various normative spelling conventions for 
syllables. Consider the many Unicode technical reports, the 850 page book, the many 
files of the Unicode database.. Consider the tremendous complexity of Unicode regex 
libraries --still not able to do simple things like automatically search for all 
variants of a syllable. Truly, it is hard to imagine how Unicode could be made any 
more complex. 

Many of the elegant features of Unixes depend on the notion of 8 bit transparency: 
pipe, cat, echo... the byte stream is the common denominator. The functions are 
general purpose and thus more useful. Bytext takes this elegant notion to it’s 
logical conclusion: not only can you process text as bytes, you can also process bytes 
as text. By default, everything is preserved and there are no special sequences to 
worry about. You can open ANY file as a text file and scan it for troubleshooting 
information or just as a way of trying to visually deduce what kind of file it is. It 
is useful to apply regular expressions and various functions like “diff” to 
arbitrary binary data not only using the same familiar functions, but also within the 
same familiar application --your text editor. 

Bernard

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of David Starner
Sent: Saturday, February 02, 2002 9:59 AM
To: [EMAIL PROTECTED]
Cc: Bernard Miller
Subject: Re: Announcing Bytext

On Sat, Feb 02, 2002 at 02:16:37AM -0800, Bernard Miller wrote:
> Hopefully flags will go off when members of this list read things that are
> equivalent to "I don't understand it but here is my opinion on it...".

I'm a very intellegent person. If I can't understand your document after
reading through it a couple times, then the fault is yours.

> Bytext is a superset of Unicode normalization form C, so it certainly
> encodes all of ASCII including form feed, and all combining characters.

Page two, first paragraph: "There are no surrogate pairs; no combining
characters". 

Page one, last paragraph: "In particular, the role of Bytext is clearly
separated from the role of markup. This is contrasted with many features
of Unicode such as interlinear annotation characters (U+FFF9..U+FFFB);
the object replacement character (U+FFFC); the nesting bidirectional
control characters; and the "Tag Characters" (U+E0000..U+E007F)."

Page 37, fourth paragraph: "Unlike Unicode, Bytext does not recognize
any "page break" type of control character even as an informative
property because page formatting is definitely in the domain of markup.
Since FF looks like a PEC in screen display it can produce unanticipated
results that are very frustrating if it causes unnecessary pages to
print, wasting paper." (It doesn't say that it doesn't have a character
named FORM FEED, but it does say that it doesn't have a page break
character - i.e. a form feed.)

> ASCII code points are rearranged partly so that characters like form feed
> can be quickly identified by normalization algorithms. This is far from
> "losing ASCII compatibility". 

If you're going to rearrange them, they might as well go away - many
were confusing with better alternatives around anyway. But moving them
breaks every program that depended on the binary value of ASCII, which
we have more or less guarenteed will stay stable under Unix. Unicode is
almost always encoded as UTF-8 under Unix for one reason - because
0x00-0x7f is 0x00-0x7f in ASCII. No surprises. 

> Also, there is no need for a new
> primitive data type to support Bytext and certainly one is not "insisted"
> upon, merely recommended.

Sorry, recommended. 

Page 6, first paragraph: "It is recommended that "uByte", as in
"unsigned byte", be the name of the data type that is used to store each
byte that is to be interpreted as Bytext. It is also recommended that
each uByte be a primitive data type in programming languages that have
primitive types."

> It is perfectly reasonable to suppose that Bytext will never catch on, but
> the mere fact that it is in an embryonic stage of development is not PROOF
> that it will never catch on. 

The fact that your standard is incoherant and you have no corporate
or govermental support is strong evidence that it will never catch on.
The fact that it's a new startup in a field of preexisting strong
contenders puts the nail in casket. If you want to do a grassroots
project, it has to be clear and attractive, and fill a needed gap. Yet
another universal charset standard is not a needed gap.

> I love Linux 

As can be noted by the Microsoft Word document and lack of HTML or plain
text.

> and do not wish to disturb it's developers any further so
> perhaps this thread can continue off list for those interested.

I see no reason to move it, personally; it's not too far off topic for
the list.

Defend your system in public. If it is to become successful, it will
have to be defended in public, with understandings of and answers for
the standard arguments.

How about an example? Say, "ᎰᎵ hat Musik gut gehört." What does that
look like bytewise in Bytext?

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: Announcing Bytext

Reply via email to