Don't mean to be rude, this discussion has plenty of value, but
perhaps ogb-discuss is not the right place for it

On Tue, May 13, 2008 at 8:18 AM, Don Cragun <don.cragun at sun.com> wrote:
> >Date: Tue, 13 May 2008 16:54:16 +0200
>  >From: Roland Mainz <roland.mainz at nrubsig.org>
>
> >
>  >Joerg Schilling wrote:
>  >> Don Cragun <don.cragun at sun.com> wrote:
>  >> > >BTW: Regarding our talk... I checked the POSIX standard and it turns 
> out
>  >> > >that od(1) support for UTF-8 "chars" is fully optional. There is no 
> need
>  to
>  >> > >support it.
>  >> >
>  >> > >J?rg
>  >> >
>  >> > Joerg,
>  >> >       This is only partly true.
>  >>
>  >> Please also comment Rolands claim that UNICODE is not a lossless coding.
>  >> Roland mentioned this recently without giving evidence.
>
>  Joerg,
>         In addition to the comments Roland made below, there are also a
>  lot of "private" character sets that contain characters (e.g., the AT&T
>  deathstar logo, the Sun logo, etc.) that do not appear in any ISO
>  standard character set.  Also, just as new English words are created
>  every year, new ideographs appear in the languages that use ideographic
>  character sets.  These ideographs may be used for a long time before
>  they are included in a UNICODE revision (and when the new ideographs
>  represent children's names, they may never be included).
>
>         - Don
>
>
>
>  >
>  >There wasn't enougth time during our meeting to show the problem in
>  >detail...
>  >
>  >> I can hardly believe that the 21 bit coding used by UNICODE still has
>  problems
>  >> to map other codings. UNICODE has been designed to be a lossless 
> coding....
>  >
>  >... I try to keep it short: Some encodings (e.g. ISO-2022) can define
>  >the language being used in the following characters (similar to the
>  >xml:lang="<lang>" tag in XML). Since Unicode folds some charcaters which
>  >are shared between languages to one codepoint (search for
>  >"han-unification") this information is lost[1], making Unicode not 100%
>  >lossless. Sounds trivial but it results in some unhappy&&nasty issues
>  >when the users mix text from multiple languages (one of the "harmless"
>  >things is that browsers will choose fonts based on the langauge being
>  >used - which may lead to issues like a japanese font being used for a
>  >single lonely character in the middle of an otherwise completely chinese
>  >text... and backwards... (and if you've followed the history of both
>  >countries in the last >= 1500 years you may realise that they don't like
>  >that much...)), unfortunately for languages where the matching countries
>  >are hyper-picky about their characters (note: That's an understatement).
>  >
>  >[1]=Technicially there are language-selector characters in a block
>  >outside the BMP (= Basic Multilinguar Plane) but I'm not sure whether
>  >they are really thought for this use - at least the existing converters
>  >do not use them and I can't find a standard (or even draft) which
>  >defines their usage. Or short: The situation is stuck badly in the mud.
>  >
>  >If you want the long story ask in i18n-discuss@, AFAIK Ienup can explain
>  >all the details better than I can do...
>  >
>  >----
>  >
>  >Bye,
>  >Roland
>
>
>
> _______________________________________________
>  ogb-discuss mailing list
>  ogb-discuss at opensolaris.org
>  http://mail.opensolaris.org/mailman/listinfo/ogb-discuss
>



-- 
PGP Public Key 0x437AF1A1
Available on hkp://pgp.mit.edu

Reply via email to