[
https://issues.apache.org/jira/browse/SANDBOX-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672937#action_12672937
]
Christian Gosch commented on SANDBOX-176:
-----------------------------------------
As the reporter, of course I'm glad to see action on this topic :-)
Actually I follow the discussion and feel in need to remind that any change
should keep in mind the purpose and use cases of this library: As far as I can
see now this is on the one hand use inside Java which requires proper UTF-8
file name handling since this is expected there, and on the other hand wide
spread compress utilities on different OSs and for different languages. Thus it
seems useless to me to insist on implementing declared standards as far as most
of the tools which "consume" archives do not recognize them anyway.
Following this, it may be useful to have methods like "build ZIP for Java
consumers" and "build ZIP for usual tools" on the one hand, and methods to
prepare the known specific settings explicitly on the other hand for developers
having the need to do so for some reason, that is: setting any flag for using
any of the mentioned techniques, and setting any encoding. I do not know about
the expected encoding of WinZip for trad. chinese...
Regarding the JDK I would prefer to stay on 1.3 if there is no actual need to
use things from newer versions, but AFAIK the lib is currently set up for 1.4
anyway (which was a problem for me as I was stuck to IBM WAS50 until Jan. 2008
by customer). To give an example: If file name parsing simply has to search for
the first / last "." to find the extension, I cannot see any need to do this
using regexp.
> Enable creation of tool-readable ZIP archives with file names containing
> non-ASCII characters
> ---------------------------------------------------------------------------------------------
>
> Key: SANDBOX-176
> URL: https://issues.apache.org/jira/browse/SANDBOX-176
> Project: Commons Sandbox
> Issue Type: Improvement
> Components: Compress
> Environment: Any / All
> Reporter: Christian Gosch
> Assignee: Stefan Bodewig
> Attachments: commons-compress-utf8-creation-svn741897.patch,
> utf8-7zip-test.zip, utf8-winzip-test.zip
>
>
> Currently it is not possible to generate externally readable ZIP archives
> with java.util.zip.* or org.apache.commons.compress.* when entries to include
> shall have names with characters outside US-ASCII. This should be changed to
> enable at least org.apache.commons.compress.* to produce ZIP archives in
> international context which are readable by usual ZIP archiver tools like
> pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For java.util.zip.* this is due to a really old flaw on handling entry names:
> They are just always rendered as UTF-8, which is kind of Java specific, and
> not as Cp437, which is expected and written by most ZIP archiver tools (or
> eventually all). For more details see:
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4244499
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4820807
> For org.apache.commons.compress.archivers.zip.* the "compress & save"
> operation can be easily improved by extending ZipArchive:
> // Add member:
> protected String m_encoding = null;
> // Add constructor:
> public ZipArchive(String encoding) {
> m_encoding = encoding;
> }
> // Extend doSave(FileOutputStream):
> // ...
> // Pack-Operation
> ZipOutputStream out = null;
> try {
> out = new ZipOutputStream(new
> BufferedOutputStream(output));
> if (m_encoding != null) { // added
> out.setEncoding(m_encoding); // added
> } // added
> while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and
> external tools can figure out the original entry names even if they contain
> non-ASCII characters. (On the other hand, Java cannot read back & deflate
> such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend
> since it currently relies completely on java.util.zip.* . The other reason
> is, that ZIP archives do not contain any hint on the character encoding used
> for file names etc. It seems that the usual tools simply use Cp437 and Java
> simply uses UTF-8 -- without any declaration of reasons. Thus a deflater has
> to try.
> For TarArchive the problem is unclear. Here the commons-compress
> implementation does not rely on third-party code as far as I can see, and TAR
> is no Java-bound file type (like JAR, which is Java-bound). Thus chances are,
> that everything works well, even when entry names with non-ASCII characters
> come into play.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.