Don't mean to be rude, this discussion has plenty of value, but perhaps ogb-discuss is not the right place for it
On Tue, May 13, 2008 at 8:18 AM, Don Cragun <don.cragun at sun.com> wrote: > >Date: Tue, 13 May 2008 16:54:16 +0200 > >From: Roland Mainz <roland.mainz at nrubsig.org> > > > > >Joerg Schilling wrote: > >> Don Cragun <don.cragun at sun.com> wrote: > >> > >BTW: Regarding our talk... I checked the POSIX standard and it turns > out > >> > >that od(1) support for UTF-8 "chars" is fully optional. There is no > need > to > >> > >support it. > >> > > >> > >J?rg > >> > > >> > Joerg, > >> > This is only partly true. > >> > >> Please also comment Rolands claim that UNICODE is not a lossless coding. > >> Roland mentioned this recently without giving evidence. > > Joerg, > In addition to the comments Roland made below, there are also a > lot of "private" character sets that contain characters (e.g., the AT&T > deathstar logo, the Sun logo, etc.) that do not appear in any ISO > standard character set. Also, just as new English words are created > every year, new ideographs appear in the languages that use ideographic > character sets. These ideographs may be used for a long time before > they are included in a UNICODE revision (and when the new ideographs > represent children's names, they may never be included). > > - Don > > > > > > >There wasn't enougth time during our meeting to show the problem in > >detail... > > > >> I can hardly believe that the 21 bit coding used by UNICODE still has > problems > >> to map other codings. UNICODE has been designed to be a lossless > coding.... > > > >... I try to keep it short: Some encodings (e.g. ISO-2022) can define > >the language being used in the following characters (similar to the > >xml:lang="<lang>" tag in XML). Since Unicode folds some charcaters which > >are shared between languages to one codepoint (search for > >"han-unification") this information is lost[1], making Unicode not 100% > >lossless. Sounds trivial but it results in some unhappy&&nasty issues > >when the users mix text from multiple languages (one of the "harmless" > >things is that browsers will choose fonts based on the langauge being > >used - which may lead to issues like a japanese font being used for a > >single lonely character in the middle of an otherwise completely chinese > >text... and backwards... (and if you've followed the history of both > >countries in the last >= 1500 years you may realise that they don't like > >that much...)), unfortunately for languages where the matching countries > >are hyper-picky about their characters (note: That's an understatement). > > > >[1]=Technicially there are language-selector characters in a block > >outside the BMP (= Basic Multilinguar Plane) but I'm not sure whether > >they are really thought for this use - at least the existing converters > >do not use them and I can't find a standard (or even draft) which > >defines their usage. Or short: The situation is stuck badly in the mud. > > > >If you want the long story ask in i18n-discuss@, AFAIK Ienup can explain > >all the details better than I can do... > > > >---- > > > >Bye, > >Roland > > > > _______________________________________________ > ogb-discuss mailing list > ogb-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/ogb-discuss > -- PGP Public Key 0x437AF1A1 Available on hkp://pgp.mit.edu
