On Mon, Nov 12, 2012 at 5:29 PM, Simos Xenitellis <[email protected]> wrote:
> Also see > https://bugzilla.gnome.org/show_bug.cgi?id=306403 > https://bugzilla.redhat.com/show_bug.cgi?id=225576 I will check later. > I think that the wider issue is about how to deal with legacy > (=non-UTF8) encodings. > Not only with filenames from within ZIP archives, but also text files > in legacy encoding (such as subtitles), > IDv3 tags and so on. For ZIP, the best tool I know so far is lsar/unar from the The Unarchiver project. http://manpages.ubuntu.com/manpages/precise/en/man1/lsar.1.html http://manpages.ubuntu.com/manpages/precise/en/man1/unar.1.html As you can see from its man page, it supports auto encoding detection and manual encoding conversion natively. However, I don't think we can get rid of Info-Zip stack. The most annoying fact about Info-Zip's UnZip is that you can only get '?' for non-ASCII characters in non-UTF8 archive. No way to get raw file name data. Patched UnZip that adds -I and -O just gives more '?'. I hope Info-Zip's next release can resolve these issues. For plain text and Gedit, see my post at gedit-list: https://mail.gnome.org/archives/gedit-list/2012-November/msg00008.html For ID3, can you show me a legitimate way to buy MP3 that contains problematic ID3? I bought some songs from Ubuntu Music Store, but they contain English meta-data only. If problematic ID3 only comes from other sources, I think users should convert ID3 encoding themselves. There are tools out there. > There have been some proposals to guess the legacy encoding (using > frequencies of letters, etc), however they add to the complexity. Most people port Mozilla's detector. There is no GNOMEism port of that library yet. But there is a KDEism one, try Kate on local encoded plain text file for inspiration. I already mentioned similar idea on gedit-list. https://mail.gnome.org/archives/gedit-list/2012-October/msg00001.html > AFAIK, if gtk/glib finds an invalid UTF-8 encoding in text, it tries > to convert from iso-8859-1 to UTF-8. > What I believe should happen is for gtk/glib to get a hint from the > operating system locale (i.e. a variable GTK_LEGACY_ENCODING), and > autoconvert any invalid text from GTK_LEGACY_ENCODING to UTF-8. I don't think the fallback is currently done in GTK/GLib level, please correct me. > For your case with ZIP archives, you deal with archives that may have > been created with a localised version of Windows, thus the filenames > may have a legacy encoding. Well, decent ZIP software on Windows, e.g., 7-zip, does created Unicode enabled ZIP archive now. Microsh*t's built-in ZIP supporting feature is another story. > Thus, my easy recommendation: > > File-roller considers all ZIP files to contain UTF-8 encoded > filenames. When it detects that the encoding is not UTF-8, then it > tries to convert from a legacy encoding to UTF-8. File-roller can > guess based on the system locale, or it can show to the user a dialog > box with the best guess, and allow to change encoding on the fly until > the filenames in the textbox make sense. File Roller is not that smart I guess. It accepts whatever Info-Zip or p7zip returns. That's why after I hacked Info-Zip interfacing code, I realized that Info-Zip itself need some hacking also. p7zip can return file names in Unicode enabled ZIP archive correctly and garbage otherwise. But p7zip doesn't support encoding conversion. I thought about p7zip hacking but I really don't like its Windowsism code base and convoluted build system. _______________________________________________ desktop-devel-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/desktop-devel-list
