[jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters

Stefan Bodewig (JIRA) Thu, 12 Feb 2009 03:00:23 -0800

    [ 
https://issues.apache.org/jira/browse/SANDBOX-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672947#action_12672947
 ]


Stefan Bodewig commented on SANDBOX-176:
----------------------------------------

My, this is getting complex, in particular since we are having parallel 
discussions in JIRA and the mailing list - I'd prefer to stick to one and use 
the dev list.

Since this issue is currently assigned to me, I should point out that even 
though there is a lot of activity around it just now, it may take some time 
until I get around to it (time constraints due to personal issues).

IMHO we have a bunch of separate issues and I'll try to address them as short 
as possible here.

* Encoding within the constraints of "older" zip clients - i.e. any InfoZIP 
based client < 3.x for example.

  This seems to be taken care of in OutputStream and ZipFile via setEncoding 
and it is known to work since this is what Ant has been using for years now.

  There currently is some discussion whether the default should be UTF-8 
(compatible with java.util.zip) or the platform's default encoding (compatible 
with Ant's code base).

* More modern ways to specify the encoding.

  All of these can only be optional IMHO.  I'm very grateful of the patches and 
the tests you've performed and I promise to review the thoroughly, I may just 
be slower than you'd hope.

* ZipInputStream

  I haven't looked at any code (GPLed code is a no-go, I'd rather see how the 
nice folks at Harmony have implemented it) yet, but given that java.util.zip 
only uses the LFH part of the extra fields I suspect the code is cheating.

  My guess is that ZipInputStream only reads the LFH parts and completely 
ignores the central directory - we could do that as well.  If I'm correct then 
the implementation may return wrong results, it is perfectly valid for an 
archive to contain "dead" local file data created by tools that updated an 
existing archive (they just append a new local file data block and rewrite the 
central directory pointing at the new location, leaving the old file data in 
place).

* "build ZIP for Java consumers":

   use JarArchiveOutputStream which explicitly sets the encoding to UTF-8 and 
it works.  This is how Ant writes jar files.

* JDK version

  It has been decided to require JDK 1.4.  Up until now there is no pressing 
reason to drop 1.3 support (there is a LinkedHashMap somewhere, but that can be 
worked around) but should be discussed with the community.

> Enable creation of tool-readable ZIP archives with file names containing 
> non-ASCII characters
> ---------------------------------------------------------------------------------------------
>
>                 Key: SANDBOX-176
>                 URL: https://issues.apache.org/jira/browse/SANDBOX-176
>             Project: Commons Sandbox
>          Issue Type: Improvement
>          Components: Compress
>         Environment: Any / All
>            Reporter: Christian Gosch
>            Assignee: Stefan Bodewig
>         Attachments: commons-compress-utf8-creation-svn741897.patch, 
> utf8-7zip-test.zip, utf8-winzip-test.zip
>
>
> Currently it is not possible to generate externally readable ZIP archives 
> with java.util.zip.* or org.apache.commons.compress.* when entries to include 
> shall have names with characters outside US-ASCII. This should be changed to 
> enable at least org.apache.commons.compress.* to produce ZIP archives in 
> international context which are readable by usual ZIP archiver tools like 
> pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For java.util.zip.* this is due to a really old flaw on handling entry names: 
> They are just always rendered as UTF-8, which is kind of Java specific, and 
> not as Cp437, which is expected and written by most ZIP archiver tools (or 
> eventually all). For more details see:
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4244499
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4820807
> For org.apache.commons.compress.archivers.zip.* the "compress & save" 
> operation can be easily improved by extending ZipArchive:
> // Add member:
>     protected String m_encoding = null;
> // Add constructor:
>     public ZipArchive(String encoding) {
>         m_encoding = encoding;
>     }
> // Extend doSave(FileOutputStream):
> // ...
>               // Pack-Operation
>               ZipOutputStream out = null;
>               try {
>                       out = new ZipOutputStream(new 
> BufferedOutputStream(output));
>             if (m_encoding != null) {   // added
>                 out.setEncoding(m_encoding);   // added
>             }  // added
>                       while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and 
> external tools can figure out the original entry names even if they contain 
> non-ASCII characters. (On the other hand, Java cannot read back & deflate 
> such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend 
> since it currently relies completely on java.util.zip.* . The other reason 
> is, that ZIP archives do not contain any hint on the character encoding used 
> for file names etc. It seems that the usual tools simply use Cp437 and Java 
> simply uses UTF-8 -- without any declaration of reasons. Thus a deflater has 
> to try.
> For TarArchive the problem is unclear. Here the commons-compress 
> implementation does not rely on third-party code as far as I can see, and TAR 
> is no Java-bound file type (like JAR, which is Java-bound). Thus chances are, 
> that everything works well, even when entry names with non-ASCII characters 
> come into play.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters

Reply via email to