RE: Announcing Bytext

Bernard Miller Tue, 05 Feb 2002 06:42:43 -0800

Pablo Saratxaga wrote:
> Kaixo!

Cool, Euskara.. I have a little Basque blood :-)


> I don't understand, how can you encode in an 8bit space all the characters
> of the world languages ?
>
> And if it is a multi-byte encoding, then it should have about the same
> problems as utf-8 or euc have when faced with byte-only utilities.

It's a variable length encoding, basically the sign bit is used to determine
character boundaries and the other 7 bits of each byte are used to determine
the scalar character value. 8 bit regexes can be used because the way the
scalar values are organized. The whole scalar value is not always needed
because the characters that would be represented if not all the bytes of a
character were read are all semantically related. For example, the first
byte of a character may represent "lowercase A". The second byte may
represent "uppercase A". The last byte (what the character actually is) may
represent "uppercase A with ring above". If you're looking for any old
variation on the theme of "letter A" (including Greek and Cyrillic
versions), you just compose a regex that only matches the first byte of each
character for a scalar value that matches "lowercase A". If you want to
match only a specific multi byte character, you search for that character
the same way you would search for a multi byte word. Despite some
complications, it also works with whole syllables from scripts such as
Devanagari.

Without even being an international standard, Bytext can be useful as a sort
of normalization form of UCS characters, kind of like an advanced form of
case folding. Anyway, thank you for your interest. I'm putting together a
FAQ at www.bytext.org which seems to be desperately needed.

Cheers,

Bernard






--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: Announcing Bytext

Reply via email to