Re: How to detect the encoding of a string?

Egmont Koblinger Thu, 02 Jun 2005 12:20:08 -0700

On Thu, Jun 02, 2005 at 07:16:13PM +0100, Simos Xenitellis wrote:

Hi,


> Hi All,
> The ZIP format (http://www.info-zip.org/pub/infozip/doc/) appears not to 
> specify the text encoding
> of the filenames of the compressed files, which causes a problem with 
> unzip utilities when they try
> to uncompress .ZIP files that include filenames in non-UTF-8 encodings.

AFAIK both Windows and Mac OS philosophy says filenames are sequence of
characters, hence the issue above really exists there, as well as on
vfat-like filesystems under Linux, which, if mounted with iocharset=utf8,
forces valid utf-8 strings from the filesystem system call interfaces.

However, according to my best knowledge, POSIX has a different point of
view, it says filenames are sequence of bytes, without any semantics on how
to interpret/display them. I don't know about any project to force the
kernel to keep filenames valid UTF-8 on ext3/reiser/... filesystems.
Hence your question does not arise in this situation: all you have to do is
keep the filename exactly the same byte sequence as you see it inside the
.zip file. So, to silently extract the contents, you don't need (and you
shouldn't do) any conversion. (You only need a conversion if you want to
extract filenames verbosely, that is, show filenames on the user interface.)

> To solve this problem, a "workaround" is to be able to detect the 
> encoding and automagically convert to UTF-8.
> 
> Is there a library or sample program that can do such a "encoding 
> detection" based on short strings of unknown encoding
> (or to choose from encodings based on a smaller list than "iconv --list")?

It is fairly easy to detect whether a byte sequence is valid UTF-8 or not.
One possible way is to try to iconv() it from UTF-8 to UTF-8 (or to UCS4)
and see if it succeeds.

However, if it fails, I see absolutely no hope to try to detect which 8-bit
character set offers a sane human meaning to that byte sequence in one of
the languages. Maybe some heuristics could be done based on dictionary
lookup of many languages... but I don't think it is worht it. Better use the
locale information, trying to remove the .UTF-8 suffix from that...

> It would be good to have something common to solve the problem for at 
> least file-roller and ark,
> which are based on graphical interfaces.

I'd recommend to extract the files exactly as they are inside the tarball,
and try to show the filenames according to the assumed filename encoding of
the UI widget being used (e.g. G_BROKEN_FILENAMES or G_FILENAME_ENCODING for
GTK+2). If UTF-8 is assumed for filename encodings, and a particular
filename is not valid UTF-8, IMHO substituing non-valid UTF-8 sequences with
the replacement character (U+FFFD) is perfectly acceptable.




-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: How to detect the encoding of a string?

Reply via email to