[ https://issues.apache.org/jira/browse/TIKA-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967654#comment-15967654 ]
Thorsten Schäfer commented on TIKA-1631: ---------------------------------------- Hi, unfortunately we are also running into this bug with Tika 1.13. Do you have any plans on fixing it? Currently the {{ZipContainerDetector}} initiates a CompressorInputStream for the file and uses {{CompressorParser.getMediaType}} to do an instanceof check (code snippet below). The bug could be circumvented by not creating the stream in the first place but move the signature checks from {{CompressorStreamFactory#createCompressorInputStream}} to {{ZipContainerDetector#detectCompressorFormat}}. Current state: {code} private static MediaType detectCompressorFormat(byte[] prefix, int length) { try { CompressorStreamFactory factory = new CompressorStreamFactory(); CompressorInputStream cis = factory.createCompressorInputStream( new ByteArrayInputStream(prefix, 0, length)); try { return CompressorParser.getMediaType(cis); } finally { IOUtils.closeQuietly(cis); } } catch (CompressorException e) { return MediaType.OCTET_STREAM; } } {code} > OutOfMemoryException in ZipContainerDetector > -------------------------------------------- > > Key: TIKA-1631 > URL: https://issues.apache.org/jira/browse/TIKA-1631 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 1.8 > Reporter: Pavel Micka > > When I try to detect ZIP container I rarely get this exception. It is caused > by the fact that the file looks like ZIP container (magics), but in fact its > random noise. So Apache decompress tries to find the size of tables (expects > correct stream), loads coincidentally huge number (as on the given place > there can be anything in the stream) and tries to allocate array of several > GB in size (hence the exception). > This bug negatively influences stability of systems running Tika, as the > decompressor can accidentally allocate as much memory as is available and > other parts of the system then might not be able to allocate their objects. > A solution might be to add additional parameter to Tika config that would > limit size of these arrays. If the size would be bigger, it would throw > exception. This change should not be hard, as method > InternalLZWInputStream.initializeTables() is protected. > Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap > space > at > org.apache.commons.compress.compressors.z._internal_.InternalLZWInputStream.initializeTables(InternalLZWInputStream.java:111) > at > org.apache.commons.compress.compressors.z.ZCompressorInputStream.<init>(ZCompressorInputStream.java:52) > at > org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:186) > at > org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:106) > at > org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:92) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) -- This message was sent by Atlassian JIRA (v6.3.15#6346)