Martin Petricek created TIKA-1717:
-------------------------------------
Summary: Tika throws exception on detecting content-type of a zip
file
Key: TIKA-1717
URL: https://issues.apache.org/jira/browse/TIKA-1717
Project: Tika
Issue Type: Bug
Reporter: Martin Petricek
When trying to detect content type of a zip file with Tika 1.10 in manner like
this:
{code}
byte[] content = ... // whole zip file.
String name = "TR_01.ZIP";
Tika tika = new Tika();
return tika.detect(content, name);
{code}
it throws an exception:
{code}
java.lang.ArrayIndexOutOfBoundsException: 13
at
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199)
at
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromCentralDirectoryData(X7875_NewUnix.java:220)
at
org.apache.commons.compress.archivers.zip.ExtraFieldUtils.parse(ExtraFieldUtils.java:174)
at
org.apache.commons.compress.archivers.zip.ZipArchiveEntry.setCentralDirectoryExtra(ZipArchiveEntry.java:476)
at
org.apache.commons.compress.archivers.zip.ZipFile.readCentralDirectoryEntry(ZipFile.java:575)
at
org.apache.commons.compress.archivers.zip.ZipFile.populateFromCentralDirectory(ZipFile.java:492)
at
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:216)
at
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:192)
at
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:153)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:141)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at org.apache.tika.Tika.detect(Tika.java:155)
at org.apache.tika.Tika.detect(Tika.java:183)
at org.apache.tika.Tika.detect(Tika.java:223)
{code}
The zip file does contain two .jpg images and is not a "special" (JAR,
Openoffice, ... ) zip file.
Unfortunately, the contents of the zip file is confidential and so I cannot
attach it to this ticket as it is, although I can provide the parameters
supplied to
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199):
{code}
data = {byte[13]@2103}
0 = 85
1 = 84
2 = 5
3 = 0
4 = 7
5 = -112
6 = -108
7 = 51
8 = 85
9 = 117
10 = 120
11 = 0
12 = 0
offset = 13
length = 0
{code}
... it seems the method tries to read more bytes than is actually available in
the buffer.
Note that 7zip and unzip can unzip the file without even a warning, so it does
not seem like a corrupted file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)