[ 
https://issues.apache.org/jira/browse/COMPRESS-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852806#comment-15852806
 ] 

Stefan Bodewig commented on COMPRESS-382:
-----------------------------------------

LZMA is identified by a three byte header, which is very short and this may 
lead to quite a few false positives. I'm not sure how to avoid this without 
removing LZMA autodetection completely.

What you describe makes me think Tika could benefit from us adding some kind of 
"does this look like a compressor stream"  or even "what kind of format do you 
think this is" kind of method in {{CompressorStreamFactory}}.

> OutOfMemoryError from CompressorStreamFactory
> ---------------------------------------------
>
>                 Key: COMPRESS-382
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-382
>             Project: Commons Compress
>          Issue Type: Bug
>          Components: Compressors
>    Affects Versions: 1.10, 1.11, 1.12
>         Environment: Windows7, jre1.8.0_101 x64
>            Reporter: Luis Filipe Nassif
>         Attachments: data.mui
>
>
> While using Tika-1.14 to detect file types, the attached 1KB file triggered 
> an OOME with 1GB heap. Tika calls 
> CompressorStreamFactory.createCompressorInputStream(in) to detect if the file 
> is a compressor stream, but CompressorStreamFactory erroneously detects it as 
> a LZMACompressorInputStream and when the LZMACompressorInputStream is 
> instanciated the OOME is thrown. This error does not happen with 
> commons-compress versions prior to 1.10, when auto detecting LZMA streams was 
> added. OOME stacktrace below:
> {code}
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at org.tukaani.xz.lz.LZDecoder.<init>(Unknown Source) ~[xz-1.5.jar:1.5]
>       at org.tukaani.xz.LZMAInputStream.initialize(Unknown Source) 
> ~[xz-1.5.jar:1.5]
>       at org.tukaani.xz.LZMAInputStream.initialize(Unknown Source) 
> ~[xz-1.5.jar:1.5]
>       at org.tukaani.xz.LZMAInputStream.<init>(Unknown Source) 
> ~[xz-1.5.jar:1.5]
>       at org.tukaani.xz.LZMAInputStream.<init>(Unknown Source) 
> ~[xz-1.5.jar:1.5]
>       at 
> org.apache.commons.compress.compressors.lzma.LZMACompressorInputStream.<init>(LZMACompressorInputStream.java:48)
>  ~[commons-compress-1.10.jar:1.10]
>       at 
> org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:251)
>  ~[commons-compress-1.10.jar:1.10]
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:109)
>  ~[tika-parsers-1.14.jar:1.14]
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:95)
>  ~[tika-parsers-1.14.jar:1.14]
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) 
> ~[tika-core-1.14.jar:1.14]
>       at 
> dpf.sp.gpinf.indexer.process.task.SignatureTask.process(SignatureTask.java:50)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.processMonitorTimeout(AbstractTask.java:203)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:152)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.sendToNextTask(AbstractTask.java:190)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:160)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.sendToNextTask(AbstractTask.java:190)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:160)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.sendToNextTask(AbstractTask.java:190)
>  ~[iped.jar:?]
>       at 
> dpf.sp.gpinf.indexer.process.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:160)
>  ~[iped.jar:?]
>       at dpf.sp.gpinf.indexer.process.Worker.process(Worker.java:174) 
> ~[iped.jar:?]
>       ... 1 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to