On Thu, Jun 02, 2005 at 07:16:13PM +0100, Simos Xenitellis wrote: Hi,
> Hi All, > The ZIP format (http://www.info-zip.org/pub/infozip/doc/) appears not to > specify the text encoding > of the filenames of the compressed files, which causes a problem with > unzip utilities when they try > to uncompress .ZIP files that include filenames in non-UTF-8 encodings. AFAIK both Windows and Mac OS philosophy says filenames are sequence of characters, hence the issue above really exists there, as well as on vfat-like filesystems under Linux, which, if mounted with iocharset=utf8, forces valid utf-8 strings from the filesystem system call interfaces. However, according to my best knowledge, POSIX has a different point of view, it says filenames are sequence of bytes, without any semantics on how to interpret/display them. I don't know about any project to force the kernel to keep filenames valid UTF-8 on ext3/reiser/... filesystems. Hence your question does not arise in this situation: all you have to do is keep the filename exactly the same byte sequence as you see it inside the .zip file. So, to silently extract the contents, you don't need (and you shouldn't do) any conversion. (You only need a conversion if you want to extract filenames verbosely, that is, show filenames on the user interface.) > To solve this problem, a "workaround" is to be able to detect the > encoding and automagically convert to UTF-8. > > Is there a library or sample program that can do such a "encoding > detection" based on short strings of unknown encoding > (or to choose from encodings based on a smaller list than "iconv --list")? It is fairly easy to detect whether a byte sequence is valid UTF-8 or not. One possible way is to try to iconv() it from UTF-8 to UTF-8 (or to UCS4) and see if it succeeds. However, if it fails, I see absolutely no hope to try to detect which 8-bit character set offers a sane human meaning to that byte sequence in one of the languages. Maybe some heuristics could be done based on dictionary lookup of many languages... but I don't think it is worht it. Better use the locale information, trying to remove the .UTF-8 suffix from that... > It would be good to have something common to solve the problem for at > least file-roller and ark, > which are based on graphical interfaces. I'd recommend to extract the files exactly as they are inside the tarball, and try to show the filenames according to the assumed filename encoding of the UI widget being used (e.g. G_BROKEN_FILENAMES or G_FILENAME_ENCODING for GTK+2). If UTF-8 is assumed for filename encodings, and a particular filename is not valid UTF-8, IMHO substituing non-valid UTF-8 sequences with the replacement character (U+FFFD) is perfectly acceptable. -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
