Project idea: UTF-8 migration support via findutils

Markus Kuhn Mon, 05 Feb 2001 04:39:43 -0800
If someone is interested in looking into a nice little project that will
be very helpful to facilitate migrating a system to UTF-8, here is an
idea:

GNU find 

  http://www.gnu.org/software/findutils/findutils.html
  ftp://prep.ai.mit.edu/pub/gnu/findutils/

walks over file system trees and matches the filenames it finds and
optionally applies operations to them.

GNU find seems the perfect platform for building a tool that helps in
changing all filenames on a system to UTF-8. All that is needed is:

1) Contact the findutil maintainers

2) Add two new tests

-nonascii
   Filename contains non-ASCII bytes (0x80-0xff)

-nonutf8
   Filename contains malformed or overlong UTF-8 sequences

3) Add one new action:

-iconvname oldcoding..newcoding
   Pass the filename through iconv with the specified coded character
   set pair provided as parameters. False if the file name in the new
   encoding exists already, otherwise attempt to rename the file from
   the old encoding to the new encoding. True if both succeeded.

With these tiny additions, system administrators can convince themselves
with

  find / -nonascii -print

that guaranteed no filenames have to be changed as part of for a UTF-8
migration. They can also convince themselves with

  find / -nonutf8 -print

that most likely they have completed a UTF-8 filename migration
successfully, or if not, how much still has to be converted.

The command

  find ~polishman/reports/*.txt -nonascii -iconvname ISO-8859-2..UTF-8

will convert all filenames of a user where it is known that all his
filenames are encoded in Latin-2.

4) Options for interactively confirming conversion requests might be a
useful addition.

5) Test whether things work reliably if directory names are converted as well.

6) Test whether

  find ~polishman/reports/*.txt -nonascii -iconvname ISO-8859-2..UTF-8 \
       -exec iconv -f ISO-8859-2 -t UTF-8 {} \;

works as expected (that is, {} is the newly converted filename). 

7) Check whether there are any other UTF-8 issues in find that need to be
addressed. Note that the latest CVS version of glibc comes now with a
UTF-8 aware regular expression library and find could be the first
beneficiary of that.


Most of the obtained functionality could perhaps also be obtained with
tricky use of regular expressions and shell scripts, but it would no
doubt be convenient to have this in an easy to remember syntax available
from the standard find command.

Anyone interested?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Project idea: UTF-8 migration support via findutils

Reply via email to