[ 
https://issues.apache.org/jira/browse/TIKA-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968123#comment-15968123
 ] 

Thorsten Schäfer edited comment on TIKA-1631 at 4/13/17 7:37 PM:
-----------------------------------------------------------------

[~talli...@mitre.org] my belief is if we can prevent the OOM it would be worth 
some effort.

Some implementation detail from commons-compress already "bubbled up" into 
Tikas {{CompressionParser#getMediaType}} method. So copying the signature 
checks from commons-compress would not introduce additional coupling but could 
save some cycles when not creating the streams.

Regarding your second concern I second that attempting to parse would also lead 
to OOM. But you might have some clients except me that use Tika just to 
determine the MIME type and do not process the file. 

Surely commons-compress could consider to restrict  the maximum code size to 
e.g. 16 bits by setting {{ZCompressorInputStream.MAX_CODE_SIZE_MASK = 0x10}}. 
Although their implementation might be correct from an academical standpoint 
nobody will probably use .Z file compression with 32 bit code words (resulting 
in 3X2 GB table size compared to 3x64KB with 16 Bit).

On the bottom line I would love if we could do something to improve upon the 
current situation which could possibly lead to JVM Crash.


was (Author: thorsten.schaefer):
[~talli...@mitre.org] my belief is if we can prevent the OOM it would be worth 
some effort.

Some implementation detail from commons-compress already "bubbled up" into 
Tikas {{CompressionParser#getMediaType}} method. So copying the signature 
checks from commons-compress would not introduce additional coupling but could 
save some cycles when not creating the streams.

Regarding your second concern I second that attempting to parse would also lead 
to OOM. But you might have some clients except me that use Tika just to 
determine the MIME type and do not process the file. 

Surely commons-compress could consider to restrict  the maximum code size to 
e.g. 16 bits by setting {{ZCompressorInputStream.MAX_CODE_SIZE_MASK = 0x10}}. 
Although their implementation might be correct from an academical standpoint 
nobody will probably use .Z file compression with 32 Bit code words (resulting 
in 3X2 GB table size compared to 3x64KB with 16 Bit).

On the bottom line I would love if we could do something to improve upon the 
current situation which could possibly lead to JVM Crash.

> OutOfMemoryException in ZipContainerDetector
> --------------------------------------------
>
>                 Key: TIKA-1631
>                 URL: https://issues.apache.org/jira/browse/TIKA-1631
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.8
>            Reporter: Pavel Micka
>         Attachments: cache.mpgindex
>
>
> When I try to detect ZIP container I rarely get this exception. It is caused 
> by the fact that the file looks like ZIP container (magics), but in fact its 
> random noise. So Apache decompress tries to find the size of tables (expects 
> correct stream), loads coincidentally huge number (as on the given place 
> there can be anything in the stream) and tries to allocate array of several 
> GB in size (hence the exception).
> This bug negatively influences stability of systems running Tika, as the 
> decompressor can accidentally allocate as much memory as is available and 
> other parts of the system then might not be able to allocate their objects.
> A solution might be to add additional parameter to Tika config that would 
> limit size of these arrays. If the size would be bigger, it would throw 
> exception. This change should not be hard, as method 
> InternalLZWInputStream.initializeTables() is protected.  
> Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap 
> space
>       at 
> org.apache.commons.compress.compressors.z._internal_.InternalLZWInputStream.initializeTables(InternalLZWInputStream.java:111)
>       at 
> org.apache.commons.compress.compressors.z.ZCompressorInputStream.<init>(ZCompressorInputStream.java:52)
>       at 
> org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:186)
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:106)
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:92)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to