Martin Petricek created TIKA-1717:
-------------------------------------

             Summary: Tika throws exception on detecting content-type of a zip 
file
                 Key: TIKA-1717
                 URL: https://issues.apache.org/jira/browse/TIKA-1717
             Project: Tika
          Issue Type: Bug
            Reporter: Martin Petricek


When trying to detect content type of a zip file with Tika 1.10 in manner like 
this:

{code}
        byte[] content = ... // whole zip file.
        String name = "TR_01.ZIP";
        Tika tika = new Tika();
        return tika.detect(content, name);
{code}

it throws an exception:

{code}
java.lang.ArrayIndexOutOfBoundsException: 13
        at 
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199)
        at 
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromCentralDirectoryData(X7875_NewUnix.java:220)
        at 
org.apache.commons.compress.archivers.zip.ExtraFieldUtils.parse(ExtraFieldUtils.java:174)
        at 
org.apache.commons.compress.archivers.zip.ZipArchiveEntry.setCentralDirectoryExtra(ZipArchiveEntry.java:476)
        at 
org.apache.commons.compress.archivers.zip.ZipFile.readCentralDirectoryEntry(ZipFile.java:575)
        at 
org.apache.commons.compress.archivers.zip.ZipFile.populateFromCentralDirectory(ZipFile.java:492)
        at 
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:216)
        at 
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:192)
        at 
org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:153)
        at 
org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:141)
        at 
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at org.apache.tika.Tika.detect(Tika.java:155)
        at org.apache.tika.Tika.detect(Tika.java:183)
        at org.apache.tika.Tika.detect(Tika.java:223)
{code}

The zip file does contain two .jpg images and is not a "special" (JAR, 
Openoffice, ... ) zip file.

Unfortunately, the contents of the zip file is confidential and so I cannot 
attach it to this ticket as it is, although I can provide the parameters 
supplied to
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199):

{code}
data = {byte[13]@2103}
 0 = 85
 1 = 84
 2 = 5
 3 = 0
 4 = 7
 5 = -112
 6 = -108
 7 = 51
 8 = 85
 9 = 117
 10 = 120
 11 = 0
 12 = 0
offset = 13
length = 0
{code}

... it seems the method tries to read more bytes than is actually available in 
the buffer.
Note that 7zip and unzip can unzip the file without even a warning, so it does 
not seem like a corrupted file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to