Re: question on Linux UTF8 support

Rich Felker Thu, 02 Feb 2006 13:18:36 -0800

On Wed, Feb 01, 2006 at 03:41:18PM -0500, [EMAIL PROTECTED] wrote:
> I don't think that's a problem for a fresh install.  Are there any tools
> for converting existing file systems from one encoding to another? 
> That's a non-trivial problem.  Assuming that all of the characters in
> the source encoding map to distinct characters in the target encoding
> (let's assume for the moment that we're talking about ISO 8859-1 to
> UTF-8), then all of the file names can be converted.  But here's the
> list of things that must happen:


I think we can safely assume the destination should always be UTF-8,
at least in the view of people on this list. :) Are there source
encodings that can't be mapped into UCS in a one-to-one manner? I
thought compatibility characters had been added for all of those, even
when the compatibility characters should not be needed.

> 1) All of the file names must be converted from the source encoding to
> the target encoding.
> 
> 2) Any symbolic links must be converted such that they point to the
> renamed file or directory.

This is easy to automate.

> 3) Files that contain file or directory names will have to be converted.
>  A couple of very obvious examples are /etc/passwd (for home
> directories) and /etc/fstab for mount points).

This is even more difficult if users have non-ascii characters in
their passwords since you'll need to crack all the passwords first. :)

As for home directories, they should change if and only if usernames
contained non-ascii characters. It's at least obvious what to do.
fstab? Do people really have mount points with non-ascii names? I
think it's rare enough that people who do can handle it manually.
Unfortunately most people don't even know how to separate the basic
unix directory tree into partitions, much less make additional local
structure.

What about the following: for each config directory (/etc,
/usr/local/etc, etc.; pardon the pun) assume all files except a fixed
list (such as ld.so.cache) are text files, and translate them from the
old encoding to UTF-8 accordingly. Make backups, of course. This
should cover all global config.

Per-user config is much more difficult, yes. I would use system like:
1. backup all dotfiles from the user's homedir to ~/.old or such.
2. use a heuristic to decide which dotfiles are text. 'file' would
   perhaps work well..?
3. convert the ones identified as text.

This will require a little patience/tolerance by users who may need to
fix things manually, but I would expect it to work alright in most
cases. A much more annoying problem than config will be users' data
files whichc are likely to be in the old encoding (html, text, ...).
For these the best thing to do would be to provide users with an easy
script (or gui tool if it's a system with X logins) to convert files.

> It's step 3 that's going to be the problem.  While you can make a more
> or less complete list of system files that would have to be converted,
> each case wound have to be considered for whether it was safe to convert
> the entire file or it was necessary to just convert file names.  There

I don't see why you would want to convert filenames but leave other
data in the legacy encoding. Can you give examples? The only case I
can see that would be difficult is text strings embedded in binary
files.

> is no way of identifying all of the scripts that might require
> conversion.  And I don't want to think about going through each user's
> .bashrc, .profile and .emacs looking for all of the other files they
> load or run.

Any user who manually sources other files from their .profile or
.emacs is savvy enough to convert their own files I think. :)

BTW an alternate idea for the whole process may be for the conversion
script to just make a "TODO list" for the sysadmin, listing things it
finds that seem to need conversion, and leaving the actual changes to
the admin (aside from the file renaming).

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: question on Linux UTF8 support

Reply via email to