On Mon, Nov 12, 2012 at 9:44 AM, Ma Xiaojun <[email protected]> wrote: > Bug Hint (not reported by me): > https://bugzilla.gnome.org/show_bug.cgi?id=648673 > > There are basically two kinds of ZIP archive. Those with random file > name encoding (not Unicode enabled) and those with UTF-8 file name > encoding and proper meta data set (Unicode enabled). >
Also see https://bugzilla.gnome.org/show_bug.cgi?id=306403 https://bugzilla.redhat.com/show_bug.cgi?id=225576 > UnZip 6.0 (the current latest released version) from Info-ZIP can > extract Unicode enabled archive correctly. However, it's listing > feature would treat any non-ASCII character in file name as '?', even > for Unicode enabled archives. This affects File Roller also so we have > above mentioned bug. > > Fortunately, UnZip has a -U option. When dealing with Unicode enabled > archives, it will escape non-ASCII character to #UXXXX or #LYYYYYY. I > already made a working patch for File Roller to utilize this. > https://gist.github.com/4057999 > > Unfortunately, #UXXXX or #LYYYYYY are also legitimate file names in > ZIP archives and UnZip's -U option doesn't escape literal # currently. > I'm trying to contact the upstream already. > http://www.info-zip.org/phpBB3/viewtopic.php?f=4&t=405 > > In the File Roller side, we may list the archive twice, one without -U > and one with -U. Then we can determine which # is literal and which # > is for escaping. There is another annoying detail worth noting here, > Vanilla UnZip show exactly one ? for one Unicode character while > patched UnZip (found in at least Arch and Ubuntu) show several ? for > one Unicode character (the number of ? equals to number of UTF-8 > bytes). > > What do you think? I think that the wider issue is about how to deal with legacy (=non-UTF8) encodings. Not only with filenames from within ZIP archives, but also text files in legacy encoding (such as subtitles), IDv3 tags and so on. There have been some proposals to guess the legacy encoding (using frequencies of letters, etc), however they add to the complexity. AFAIK, if gtk/glib finds an invalid UTF-8 encoding in text, it tries to convert from iso-8859-1 to UTF-8. What I believe should happen is for gtk/glib to get a hint from the operating system locale (i.e. a variable GTK_LEGACY_ENCODING), and autoconvert any invalid text from GTK_LEGACY_ENCODING to UTF-8. For your case with ZIP archives, you deal with archives that may have been created with a localised version of Windows, thus the filenames may have a legacy encoding. Thus, my easy recommendation: File-roller considers all ZIP files to contain UTF-8 encoded filenames. When it detects that the encoding is not UTF-8, then it tries to convert from a legacy encoding to UTF-8. File-roller can guess based on the system locale, or it can show to the user a dialog box with the best guess, and allow to change encoding on the fly until the filenames in the textbox make sense. Simos _______________________________________________ desktop-devel-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/desktop-devel-list
