Caleb Postlethwait created TIKA-3152:
----------------------------------------

             Summary: Calling autoDetectParser.parse results in Unexpected 
RuntimeException on .msg file with large attachment.
                 Key: TIKA-3152
                 URL: https://issues.apache.org/jira/browse/TIKA-3152
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.24
         Environment: Running on ubuntu machines in AWS Cloud
            Reporter: Caleb Postlethwait


When calling parse on an msg file stream I'm getting a RuntimeException from 
Tika. The msg file contains a MOV file attachment approximately 22 MB. 
Unfortunately, I'm unable to share the file as it is client data, my QA are 
trying to re-produce with another file but aren't having much luck. I'm able to 
open the msg file with outlook and the attached MOV file and they seem ok. I'm 
including the stack trace, the code leading up to the parse, and the 
tika-config we're using.

 

Code Snippet:

config = TikaConfigFactory.getTikaConfig();
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
autoDetectParser.parse(input, handler, metadata, context);

 

Stacktrace:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.officepar...@bdef9dborg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@bdef9db at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
com.stormed.processing.TikaInfor.getInfor(TikaInfor.java:102) at 
com.stormed.processing.AbstractFileProduct.addNatural_Metadata(AbstractFileProduct.java:114)
 at com.stormed.processing.ProcessingMain.processing(ProcessingMain.java:280) 
at com.stormed.processing.ProcessingMain.<init>(ProcessingMain.java:93) at 
com.stormed.processing.common.ProcessingBuilder.run(ProcessingBuilder.java:45) 
at com.stormed.proxy.AppRunner.run(AppRunner.java:21) at 
com.stormed.proxy.ProxyMain.runApp(ProxyMain.java:228) at 
com.stormed.proxy.ProxyMain.lambda$main$0(ProxyMain.java:120) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.IndexOutOfBoundsException: Block 45824 not found at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:429)
 at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.readCoreContents(POIFSFileSystem.java:362)
 at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:316)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:123) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 
14 moreCaused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes 
from 23462400 in stream of length 23462400 at 
org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:47)
 at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:427)
 ... 18 more

 

 

Tika Config:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
 <service-loader dynamic="false" loadErrorHandler="IGNORE" 
initializableProblemHandler="IGNORE"/>
 <encodingDetectors>
 <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector"/>
 <encodingDetector 
class="org.apache.tika.parser.txt.UniversalEncodingDetector"/>
 <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector"/>
 </encodingDetectors>
 <detectors>
 <detector class="org.apache.tika.detect.OverrideDetector"/>
 <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
 <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
 <detector class="org.gagravarr.tika.OggDetector"/>
 <detector class="org.apache.tika.mime.MimeTypes"/>
 </detectors>
 <parsers>
 <parser class="org.apache.tika.parser.apple.AppleSingleFileParser"/>
 <parser class="org.apache.tika.parser.asm.ClassParser"/>
 <parser class="org.apache.tika.parser.audio.AudioParser"/>
 <parser class="org.apache.tika.parser.audio.MidiParser"/>
 <parser class="org.apache.tika.parser.chm.ChmParser"/>
 <parser class="org.apache.tika.parser.code.SourceCodeParser"/>
 <parser class="org.apache.tika.parser.crypto.Pkcs7Parser"/>
 <parser class="org.apache.tika.parser.crypto.TSDParser"/>
 <parser class="org.apache.tika.parser.csv.TextAndCSVParser"/>
 <parser class="org.apache.tika.parser.dbf.DBFParser"/>
 <parser class="org.apache.tika.parser.dif.DIFParser"/>
 <parser class="org.apache.tika.parser.dwg.DWGParser"/>
 <parser class="org.apache.tika.parser.epub.EpubParser"/>
 <parser class="org.apache.tika.parser.executable.ExecutableParser"/>
 <parser class="org.apache.tika.parser.feed.FeedParser"/>
 <parser class="org.apache.tika.parser.font.AdobeFontMetricParser"/>
 <parser class="org.apache.tika.parser.font.TrueTypeParser"/>
 <parser class="org.apache.tika.parser.gdal.GDALParser"/>
 <parser class="org.apache.tika.parser.geoinfo.GeographicInformationParser"/>
 <parser class="org.apache.tika.parser.grib.GribParser"/>
 <parser class="org.apache.tika.parser.hdf.HDFParser"/>
 <parser class="org.apache.tika.parser.html.HtmlParser"/>
 <parser class="org.apache.tika.parser.hwp.HwpV5Parser"/>
 <parser class="org.apache.tika.parser.image.BPGParser"/>
 <parser class="org.apache.tika.parser.image.ICNSParser"/>
 <parser class="org.apache.tika.parser.image.ImageParser"/>
 <parser class="org.apache.tika.parser.image.PSDParser"/>
 <parser class="org.apache.tika.parser.image.TiffParser"/>
 <parser class="org.apache.tika.parser.image.WebPParser"/>
 <parser class="org.apache.tika.parser.iptc.IptcAnpaParser"/>
 <parser class="org.apache.tika.parser.isatab.ISArchiveParser"/>
 <parser class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
 <parser class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
 <parser class="org.apache.tika.parser.jpeg.JpegParser"/>
 <parser class="org.apache.tika.parser.mail.RFC822Parser"/>
 <parser class="org.apache.tika.parser.mat.MatParser"/>
 <parser class="org.apache.tika.parser.mbox.MboxParser"/>
 <parser class="org.apache.tika.parser.mbox.OutlookPSTParser"/>
 <parser class="org.apache.tika.parser.microsoft.EMFParser"/>
 <parser class="org.apache.tika.parser.microsoft.JackcessParser"/>
 <parser class="org.apache.tika.parser.microsoft.MSOwnerFileParser"/>
 <parser class="org.apache.tika.parser.microsoft.OfficeParser"/>
 <parser class="org.apache.tika.parser.microsoft.OldExcelParser"/>
 <parser class="org.apache.tika.parser.microsoft.TNEFParser"/>
 <parser class="org.apache.tika.parser.microsoft.WMFParser"/>
 <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
 <parser 
class="org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.Word2006MLParser"/>
 <parser class="org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser"/>
 <parser class="org.apache.tika.parser.microsoft.xml.WordMLParser"/>
 <parser class="org.apache.tika.parser.mp3.Mp3Parser"/>
 <parser class="org.apache.tika.parser.mp4.MP4Parser"/>
 <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
 <parser class="org.apache.tika.parser.odf.OpenDocumentParser"/>
 <parser class="org.apache.tika.parser.pdf.PDFParser"/>
 <parser class="org.apache.tika.parser.pkg.CompressorParser"/>
 <parser class="org.apache.tika.parser.pkg.PackageParser"/>
 <parser class="org.apache.tika.parser.pkg.RarParser"/>
 <parser class="org.apache.tika.parser.rtf.RTFParser"/>
 <parser class="org.apache.tika.parser.sas.SAS7BDATParser"/>
 <parser class="org.apache.tika.parser.video.FLVParser"/>
 <parser class="org.apache.tika.parser.wordperfect.QuattroProParser"/>
 <parser class="org.apache.tika.parser.wordperfect.WordPerfectParser"/>
 <parser class="org.apache.tika.parser.xliff.XLIFF12Parser"/>
 <parser class="org.apache.tika.parser.xliff.XLZParser"/>
 <parser class="org.apache.tika.parser.xml.DcXMLParser"/>
 <parser class="org.apache.tika.parser.xml.FictionBookParser"/>
 <parser class="org.gagravarr.tika.FlacParser"/>
 <parser class="org.gagravarr.tika.OggParser"/>
 <parser class="org.gagravarr.tika.OpusParser"/>
 <parser class="org.gagravarr.tika.SpeexParser"/>
 <parser class="org.gagravarr.tika.TheoraParser"/>
 <parser class="org.gagravarr.tika.VorbisParser"/>
 </parsers>
</properties>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to