Pablo Saratxaga wrote: > Kaixo! Cool, Euskara.. I have a little Basque blood :-)
> I don't understand, how can you encode in an 8bit space all the characters > of the world languages ? > > And if it is a multi-byte encoding, then it should have about the same > problems as utf-8 or euc have when faced with byte-only utilities. It's a variable length encoding, basically the sign bit is used to determine character boundaries and the other 7 bits of each byte are used to determine the scalar character value. 8 bit regexes can be used because the way the scalar values are organized. The whole scalar value is not always needed because the characters that would be represented if not all the bytes of a character were read are all semantically related. For example, the first byte of a character may represent "lowercase A". The second byte may represent "uppercase A". The last byte (what the character actually is) may represent "uppercase A with ring above". If you're looking for any old variation on the theme of "letter A" (including Greek and Cyrillic versions), you just compose a regex that only matches the first byte of each character for a scalar value that matches "lowercase A". If you want to match only a specific multi byte character, you search for that character the same way you would search for a multi byte word. Despite some complications, it also works with whole syllables from scripts such as Devanagari. Without even being an international standard, Bytext can be useful as a sort of normalization form of UCS characters, kind of like an advanced form of case folding. Anyway, thank you for your interest. I'm putting together a FAQ at www.bytext.org which seems to be desperately needed. Cheers, Bernard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
