[jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters

Wolfgang Glas (JIRA) Wed, 11 Feb 2009 02:37:27 -0800

    [ 
https://issues.apache.org/jira/browse/SANDBOX-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672580#action_12672580
 ]


Wolfgang Glas commented on SANDBOX-176:
---------------------------------------

Hi Stefan,

  Right now, I've got some more time to study your comments and to review the 
ZIP appnote as well as java.util.zip.ZipInputStream:

1) I did not find any hint in the appnote that indicates, which "version needed 
to extract" should be set when the EFS flag is set. I've already experimented 
with setting "version needed to extract" to 6.3 when the EFS flag is set. 
However, all ZIP programs seem to ignore the "version needed to extract".

2) Nevertheless, using the EFS seems to be discouraged, because there arer to 
many programs out ther which do not cope with EFS/utf-8. At least winzip seems 
to stick to this policy as it writes unicode extra fields for entries not 
encodable by CP437.

3) If understand you right,  we should keep the 'setEncoding(String)' interface 
as too many ant API-users are used to this.

4) I think we should introduce a parameter 'setFallbackToEFS(boolean)' , which 
uses utf-8 and the EFS flag for entries not encodable by the encoding set 
through setEncoding(String).

5) Additionally, we may add a parameter 'setFallbackToUnicodeExtras(boolean)', 
which triggers the creation of UnicodePath/UnicodeComment extra fields for 
names not encodable by the encoding set through setEncoding(String).

6) We might conider adding a method tuneForUnicodeComaptibility(), which 
arranges for the default setting in away, that is compatible with most 
decompressors as to the knowledge of the implementors. This method may arrange 
for a different setting of parameters as decoders adopt new standards in the 
future.

7) IMHO there are many situations, where someone might decode a ZIP stream 
instead of a zip file: Resources, which are read from a jar-file rather than a 
classes folder, servlet input streams, etc... Therefore I'd like to see 
aunicode-enabeld version of ZipArchiveInputStream in commons-compress. 
ZipArchiveInputStream code in java.util.zip of openjdk-6 is about 600 LoC 
(including the base class InfalterInputStream), so I think it should be 
possible to providefor a reimplementation.

8) Do you know whether it is possible to take openjdk-6 code and to import it 
into commoms-compress? Are there license issues with such an import ?

9) How about JDK/JRE compliance? My implementation of ZipEncodingHelper uses 
java.nio.charset.Charset, whih is part of jre-1.4. Does commons-compress still 
target jre-1.3 or is it OK to use 1.4 APIs?

  Best regards,

   Wolfgang


> Enable creation of tool-readable ZIP archives with file names containing 
> non-ASCII characters
> ---------------------------------------------------------------------------------------------
>
>                 Key: SANDBOX-176
>                 URL: https://issues.apache.org/jira/browse/SANDBOX-176
>             Project: Commons Sandbox
>          Issue Type: Improvement
>          Components: Compress
>         Environment: Any / All
>            Reporter: Christian Gosch
>            Assignee: Stefan Bodewig
>         Attachments: commons-compress-utf8-creation-svn741897.patch, 
> utf8-7zip-test.zip, utf8-winzip-test.zip
>
>
> Currently it is not possible to generate externally readable ZIP archives 
> with java.util.zip.* or org.apache.commons.compress.* when entries to include 
> shall have names with characters outside US-ASCII. This should be changed to 
> enable at least org.apache.commons.compress.* to produce ZIP archives in 
> international context which are readable by usual ZIP archiver tools like 
> pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For java.util.zip.* this is due to a really old flaw on handling entry names: 
> They are just always rendered as UTF-8, which is kind of Java specific, and 
> not as Cp437, which is expected and written by most ZIP archiver tools (or 
> eventually all). For more details see:
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4244499
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4820807
> For org.apache.commons.compress.archivers.zip.* the "compress & save" 
> operation can be easily improved by extending ZipArchive:
> // Add member:
>     protected String m_encoding = null;
> // Add constructor:
>     public ZipArchive(String encoding) {
>         m_encoding = encoding;
>     }
> // Extend doSave(FileOutputStream):
> // ...
>               // Pack-Operation
>               ZipOutputStream out = null;
>               try {
>                       out = new ZipOutputStream(new 
> BufferedOutputStream(output));
>             if (m_encoding != null) {   // added
>                 out.setEncoding(m_encoding);   // added
>             }  // added
>                       while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and 
> external tools can figure out the original entry names even if they contain 
> non-ASCII characters. (On the other hand, Java cannot read back & deflate 
> such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend 
> since it currently relies completely on java.util.zip.* . The other reason 
> is, that ZIP archives do not contain any hint on the character encoding used 
> for file names etc. It seems that the usual tools simply use Cp437 and Java 
> simply uses UTF-8 -- without any declaration of reasons. Thus a deflater has 
> to try.
> For TarArchive the problem is unclear. Here the commons-compress 
> implementation does not rely on third-party code as far as I can see, and TAR 
> is no Java-bound file type (like JAR, which is Java-bound). Thus chances are, 
> that everything works well, even when entry names with non-ASCII characters 
> come into play.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters

Reply via email to