If someone is interested in looking into a nice little project that will be very helpful to facilitate migrating a system to UTF-8, here is an idea: GNU find http://www.gnu.org/software/findutils/findutils.html ftp://prep.ai.mit.edu/pub/gnu/findutils/ walks over file system trees and matches the filenames it finds and optionally applies operations to them. GNU find seems the perfect platform for building a tool that helps in changing all filenames on a system to UTF-8. All that is needed is: 1) Contact the findutil maintainers 2) Add two new tests -nonascii Filename contains non-ASCII bytes (0x80-0xff) -nonutf8 Filename contains malformed or overlong UTF-8 sequences 3) Add one new action: -iconvname oldcoding..newcoding Pass the filename through iconv with the specified coded character set pair provided as parameters. False if the file name in the new encoding exists already, otherwise attempt to rename the file from the old encoding to the new encoding. True if both succeeded. With these tiny additions, system administrators can convince themselves with find / -nonascii -print that guaranteed no filenames have to be changed as part of for a UTF-8 migration. They can also convince themselves with find / -nonutf8 -print that most likely they have completed a UTF-8 filename migration successfully, or if not, how much still has to be converted. The command find ~polishman/reports/*.txt -nonascii -iconvname ISO-8859-2..UTF-8 will convert all filenames of a user where it is known that all his filenames are encoded in Latin-2. 4) Options for interactively confirming conversion requests might be a useful addition. 5) Test whether things work reliably if directory names are converted as well. 6) Test whether find ~polishman/reports/*.txt -nonascii -iconvname ISO-8859-2..UTF-8 \ -exec iconv -f ISO-8859-2 -t UTF-8 {} \; works as expected (that is, {} is the newly converted filename). 7) Check whether there are any other UTF-8 issues in find that need to be addressed. Note that the latest CVS version of glibc comes now with a UTF-8 aware regular expression library and find could be the first beneficiary of that. Most of the obtained functionality could perhaps also be obtained with tricky use of regular expressions and shell scripts, but it would no doubt be convenient to have this in an easy to remember syntax available from the standard find command. Anyone interested? Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
