Transparent codeset conversion in filesystems ? / was: Re: [osol-discuss] Suggestion: PCFS Open Development Project

Roland Mainz Tue, 23 Jan 2007 22:31:26 -0800

Ienup Sung wrote:
> 
> Yes, we have numerous locales with different codesets. Solaris 10,
> as an example, we have 165 locales with 23 different codesets.
> In many cases, codesets use quite similar representation forms and yet
> the mappings between the code point values and actual characters/glyphs
> are quite different.
> 
> Underlying file systems also have various ways of depositing characters
> althought many new file systems are converging to Unicode. (Even then, among
> those rather new file systems that use Unicode, they use sometimes
> different Unicode encodings not entirely compatible with others byte by
> byte.)
> 
> To solve the problem of not correctly showing non-ASCII characters and yet
> keeping the maximum compatibility with existing applications and also
> numerous locales and codesets it appears that either we tag codeset for
> each file or adopt Unicode, in particular, UTF-8, as the file system codeset
> as the one thing and then add/doing transparent codeset conversion as
> the other. These two could go together or separately supported too.


... The various LC_* variables can point to different locales+encodings
(for example ja_JP.UTF-8 vs. ja_JP.PCK) ... isn't there some risk that
transparent translations somehow cause havoc
? I assume it's not the case (assuming that only filenames are converted
transparently) but did anyone thought this detail (different LC_*
variables pointing to different locales/encodings) to the end (I can't
think anymore in a straight line after ~~48h brain_uptime, please excuse
me if I start asking silly things...) ?

Another (likely more real-world) problem is: How would such a
"transparent conversion" handle characters which cannot be represented
in the current locale - for example how should the "C"/"POSIX" locale
handle german umlauts (e.g. "öäü") ? Just replace them via '?', use
transliteration (e.g. 'ü' = "ue", 'ö' == "oe" etc.), encode them
URLencoding-like (another, (more portable ?) style of transliteration)
or invent some all-new solution for the problem ?

Final thought: I guess if Solaris wants to use Unicode (locales) more
widespread some of tools in /usr/bin/ need to be replaced with the
versions from /usr/xpg4/bin/ since the /usr/bin/ tools suffer from the
widespread "I don't care about multibyte locales"-disease (question is
whether this is considered a "bug" or "feature"... ;-/ ) ... how should
this be handled if the "bugfix" (e.g. handle multibyte characters
correctly) collides with something like "backwards compatibility" ?

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Transparent codeset conversion in filesystems ? / was: Re: [osol-discuss] Suggestion: PCFS Open Development Project

Reply via email to