[
https://issues.apache.org/jira/browse/COMPRESS-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543410#comment-13543410
]
Woo Ju Shin commented on COMPRESS-212:
--------------------------------------
I have tried a workaround to this.
Next code is getNextTarEntry() of TarArchiveInputStream.java.
/**
* Get the next entry in this tar archive. This will skip
* over any remaining data in the current entry, if there
* is one, and place the input stream at the header of the
* next entry, and read the header and instantiate a new
* TarEntry from the header bytes and return that entry.
* If there are no more entries in the archive, null will
* be returned to indicate that the end of the archive has
* been reached.
*
* @return The next TarEntry in the archive, or null.
* @throws IOException on error
*/
public TarArchiveEntry getNextTarEntry() throws IOException {
if (hasHitEOF) {
return null;
}
if (currEntry != null) {
long numToSkip = entrySize - entryOffset;
while (numToSkip > 0) {
long skipped = skip(numToSkip);
if (skipped <= 0) {
throw new RuntimeException("failed to skip current tar
entry");
}
numToSkip -= skipped;
}
readBuf = null;
}
byte[] headerBuf = getRecord();
if (hasHitEOF) {
currEntry = null;
return null;
}
try {
currEntry = new TarArchiveEntry(headerBuf, encoding);
} catch (IllegalArgumentException e) {
IOException ioe = new IOException("Error detected parsing the
header");
ioe.initCause(e);
throw ioe;
}
entryOffset = 0;
entrySize = currEntry.getSize();
if (currEntry.isGNULongNameEntry()) {
// read in the name
StringBuffer longName = new StringBuffer();
byte[] buf = new byte[SMALL_BUFFER_SIZE];
int length = 0;
while ((length = read(buf)) >= 0) {
longName.append(new String(buf, 0, length)); // TODO default
charset?
}
getNextEntry();
if (currEntry == null) {
// Bugzilla: 40334
// Malformed tar file - long entry name not followed by entry
return null;
}
// remove trailing null terminator
if (longName.length() > 0
&& longName.charAt(longName.length() - 1) == 0) {
longName.deleteCharAt(longName.length() - 1);
}
currEntry.setName(longName.toString());
}
if (currEntry.isPaxHeader()){ // Process Pax headers
paxHeaders();
}
if (currEntry.isGNUSparse()){ // Process sparse files
readGNUSparse();
}
// If the size of the next element in the archive has changed
// due to a new size being reported in the posix header
// information, we update entrySize here so that it contains
// the correct value.
entrySize = currEntry.getSize();
return currEntry;
}
There's a comment '//TODO default charset?'.
This part seems to neglect the encoding set to TarArchiveInputStream().
I tried to get the encoding that I first set to TarArchiveInputStream() and
constructed the filename in one byte[] variable, and then used the encoding to
change the byte[] variable to String so that it could be set to entry by
entry.setName().
This workaround works well for now. But obviously I need more tests to be done.
I'll be trying more tests until next week.
> TarArchiveEntry getName() returns wrongly encoded name even when you set
> encoding to TarArchiveInputStream
> ----------------------------------------------------------------------------------------------------------
>
> Key: COMPRESS-212
> URL: https://issues.apache.org/jira/browse/COMPRESS-212
> Project: Commons Compress
> Issue Type: Bug
> Affects Versions: 1.4.1
> Environment: Red Hat Enterprise Linux, MS Windows 7
> Reporter: Woo Ju Shin
> Priority: Minor
>
> I have two file systems. One is Red Hat Linux, the other is MS Windows.
> I created a *.tgz file in Red Hat Linux and tried to decompress it in MS
> Windows using Commons Compress.
> The default system encoding are different. UTF-8 in Red Hat Linux and CP949
> in MS Windows.
> It seems that the file name encoding follows the default encoding even though
> when I use the following to untar it.
> FileInputStream fis = new FileInputStream(new File(*.tgz));
> TarArchiveInputStream zis = new TarArchiveInputStream(new
> BufferedInputStream(fis),encodingOfRedHatLinux);
> while ((entry = (TarArchiveEntry)zis.getNextEntry()) != null)
> {
> entry.getName(); // filename is not UTF-8 it is encoded in CP949 and so the
> filename isn't consistent
> }
> By referring to this
> /**
> * Constructor for TarInputStream.
> * @param is the input stream to use
> * @param encoding name of the encoding to use for file names
> * @since Commons Compress 1.4
> */
> public TarArchiveInputStream(InputStream is, String encoding) {
> this(is, TarBuffer.DEFAULT_BLKSIZE, TarBuffer.DEFAULT_RCDSIZE,
> encoding);
> }
> encoding should be used for file names.
> But actually this doesn't seem to work.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira