Hi Pablo, Today at 23:17, Pablo Saratxaga wrote:
> It is indeed a good feature to do so; > but the *smallest* unit for which language information is usefull > are *words*, not characters/letters. Indeed. But how do you achieve that? It's easiest to have characters hold language information. Or, keeping language markers intertangled with text itself? >> So, "jota" would still make sense in Spanish, whatever it was >> pronounced as, but not much sense in English (since it's not a word >> there). I think this is a good property to know. > > No, it is useless. The letter "j", alone, is the same letter on all > languages using the latin script. There is absolutely no gain in > creating differences based on language (plus, I know of no language > where there is a word consisting of the single letter "j"). I disagree: there's a lot to be gained by creating differences based on language, and I already gave examples of what could be gained. On the topic of letters, is "Å" the same letter in both Croatian, English and Spanish (they all use Latin script, after all :)? > "disambiguating" letters depending on the language is a very bad idea, > beacause it destroys the interexchangeability of documents. Uhm, how come? Care to elaborate? Why is using another (standard, provided it becomes one) encoding destroying interchangability of documents? It would destroy it as much as using UTF-16 instead of UTF-8 would, or as much as using Unicode over ISO-8859-1 would: your software would have to know how to interpret it and map between them. > You have problems to do google searchs in Serbian because a text > can be in two different scripts; I'm actually more concerned with the display and input problem, rather than doing Google searches (I mentioned Google only to show that people care about language more, yet Google is not able to deduce such information correctly with the current state of encodings). I want to type "letters", and display it using any of the scripts simply by changing a font. I'm native Serbian, and most native Serbian speakers tend to think of it as a display property (you certainly know that, since I know you're well clued about Serbian problems :). If Unicode fails for my language, how can you claim that it's completely correct? It's simply not, it might work for "mostly", but it doesn't make it correct. If your car explodes "only" every 100th time you start it, would you drive it at all? > now with your idea of disambiguating letters it means that the same > problem will exist for almost all languages (minus the very few ones > using a unique script), it would even be worst, as a same English > text, for example, could be encoded in dozens of different (eg: in > English-letters, Spanish-letters, Portuguese-letters, > French-letters, German-letters, Italian-letters, Indonesian-letters, > Polish-letters, Irish-letters, Welsh-letters, Danish-letters,...). Well, it's not up to encoding to enforce correct usage of it. After all, one can type text using "small caps" region of Unicode standard, but how often does that happen? The real issue is estimating how common such misuse would be? I believe most people input their text with correct language selection, but this cannot be proved either right or wrong without doing a real world, large-scale experiment. After all, it would still be trivial to dump language data from characters and to map all data to a glyph repository such as Unicode or AFII, depending on the application. If the problem was so visible, no search engine would have problems with it (but it could still make use of language properties when user explicitely asks for it). >> We must agree that these differences Unicode went after are >> glyph-based, rather than character-based. > > They are character based. > With a character defined as an atomic element of a script (there are of > course a lot of exceptions due to historical reasons, but that is the > basic idea). Ok, read my "character" as "letter", if you use this definition of a character. So yes, Unicode is a collection of script symbols, which you call characters, and I call glyphs :) But, that was not the intention of Unicode. FWIW, Unicode definition of a character[1] would allow (even prefer) my interpretation as well: > Character. (1) The smallest component of written language that has > semantic value; refers to the abstract meaning and/or shape, rather > than a specific shape (see also glyph), though in code tables some > form of visual representation is essential for the readerâs > understanding. (2) Synonym for abstract character. (3) The basic > unit of encoding for the Unicode character encoding. (4) The English > name for the ideographic written elements of Chinese origin. (See > ideograph (2).) There's no mention of script here (there is of "language"), and I'd certainly consider "smallest component of written language that has semantic value" a letter. See also [2], where it is clearly pointed out that letter is closely tied to character (i.e. character is encompassing/superset concept, not a different concept as you try to put it). Your view on "character" more closely resembles "grapheme" according to [3] (and (2) in there explicitely states that users commonly think of grapheme as character, but that's not what Unicode considers it as). [1] http://www.unicode.org/glossary/#character [2] http://www.unicode.org/glossary/#letter [2] http://www.unicode.org/glossary/#grapheme > So, unicode is a collection of *scripts*, each script is separate and > independent of the others, and each script is a collection of characters But that's completely false, and you know it: scripts are not independent, except somewhat in their graphic/display properties! Scripts commonly have mappings between them, depending on the _language_ of use! Think of Pin-Yin as a relation between otherwise unrelated scripts. Many languages are multi-script. Are you saying that digraphs (Ç, Ç, Ç, Ç, Ç, Ç) are completely independent of Serbian Cyrillic script? > belonging to that script (there are some special characters, like > generic puntuation and ascii digits, that can be used in conjunction > with most scripts, but outside the shared puntuation characters, the > different characters are exclusive to a given script, even if there are > similarities in some cases with other characters of another script). > The basic concept to encode writing is the script, that is so when > electronically encoding text simply because that is so when writting > text by hand or press. But script also depends on the language, and that's my entire point. You can claim that's not so as much as you wish, but there're many differences between Serbian Cyrillic and Russian Cyrillic: they even use completely different glyphs ("Ð" and "Ñ") for arguably the same sound. Writing text by hand or press matrices are not a really good examples: many Cyrillic or Greek (eg. uppercase Greek in TeX) characters can be gotten using Latin forms (i.e. this is more example of glyph usage, not of characters). I.e. it proves nothing, except that it doesn't prove anything :) >> I say that "a" and "Ð" are same characters in Serbian, > > They are not. > They may be the same *letter* in Serbian. > But a letter is not a character (in Spanish, "ch" is a letter (yes, I'm > a traditionalist), as well in Serbian "lj" and "nj" are letters; > however the involved characters are "c", "h", "l", "n", "j". > Note also how in cyrillic script "Ñ" and "Ñ" are single caracters, > note also that "ÐÑ" and "ÐÑ" are not single characters. Ok, we've used different definitions of a "character". If I accept your definition, then you're correct. What I meant is, of course, that they're same letters (using your definition). > You wrongly see latin and cyrillic variants of Serbian as simple > differences in shape of the same characters; that is not so, > you should instead look at it as two orthographic variants. If characters are defined as script elements, then sure (after all, I'm not that dumb to claim that something defined as script element is independent of the script). I was clearly talking about characters as letters, or elements used to write down a language. If, OTOH, characters are defined as "the smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape" (from Unicode Glossary, cited above), then I'm not wrong at all: "Ð"/"a" both are smallest components of written Serbian that have same semantic value, and refer to same abstract meaning, but not the same shape (ok, they're coincidentally the same shapes as well; I could have used Ð/d instead). I.e. they're the one and single character. You're trying to pull up a trick on me with your word usage :) But this only points out all the problems with Unicode (i.e. policy such as "no more precomposed glyphs" means that some characters will not be encoded, but that glyphs of those characters would be attainable through composing mechanisms). So, Unicode is a glyph repository, no matter what tricks you try to pull out :) Finally, I'm not saying Unicode is fundamentally bad, only that there could be something even better for encoding textual data. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
