Caleb Postlethwait created TIKA-3152: ----------------------------------------
Summary: Calling autoDetectParser.parse results in Unexpected RuntimeException on .msg file with large attachment. Key: TIKA-3152 URL: https://issues.apache.org/jira/browse/TIKA-3152 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.24 Environment: Running on ubuntu machines in AWS Cloud Reporter: Caleb Postlethwait When calling parse on an msg file stream I'm getting a RuntimeException from Tika. The msg file contains a MOV file attachment approximately 22 MB. Unfortunately, I'm unable to share the file as it is client data, my QA are trying to re-produce with another file but aren't having much luck. I'm able to open the msg file with outlook and the attached MOV file and they seem ok. I'm including the stack trace, the code leading up to the parse, and the tika-config we're using. Code Snippet: config = TikaConfigFactory.getTikaConfig(); Parser autoDetectParser = new AutoDetectParser(config); ParseContext context = new ParseContext(); context.set(TikaConfig.class, config); autoDetectParser.parse(input, handler, metadata, context); Stacktrace: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.officepar...@bdef9dborg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@bdef9db at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at com.stormed.processing.TikaInfor.getInfor(TikaInfor.java:102) at com.stormed.processing.AbstractFileProduct.addNatural_Metadata(AbstractFileProduct.java:114) at com.stormed.processing.ProcessingMain.processing(ProcessingMain.java:280) at com.stormed.processing.ProcessingMain.<init>(ProcessingMain.java:93) at com.stormed.processing.common.ProcessingBuilder.run(ProcessingBuilder.java:45) at com.stormed.proxy.AppRunner.run(AppRunner.java:21) at com.stormed.proxy.ProxyMain.runApp(ProxyMain.java:228) at com.stormed.proxy.ProxyMain.lambda$main$0(ProxyMain.java:120) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.IndexOutOfBoundsException: Block 45824 not found at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:429) at org.apache.poi.poifs.filesystem.POIFSFileSystem.readCoreContents(POIFSFileSystem.java:362) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:316) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:123) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 14 moreCaused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 23462400 in stream of length 23462400 at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:47) at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:427) ... 18 more Tika Config: <?xml version="1.0" encoding="UTF-8"?> <properties> <service-loader dynamic="false" loadErrorHandler="IGNORE" initializableProblemHandler="IGNORE"/> <encodingDetectors> <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector"/> <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector"/> <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector"/> </encodingDetectors> <detectors> <detector class="org.apache.tika.detect.OverrideDetector"/> <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/> <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/> <detector class="org.gagravarr.tika.OggDetector"/> <detector class="org.apache.tika.mime.MimeTypes"/> </detectors> <parsers> <parser class="org.apache.tika.parser.apple.AppleSingleFileParser"/> <parser class="org.apache.tika.parser.asm.ClassParser"/> <parser class="org.apache.tika.parser.audio.AudioParser"/> <parser class="org.apache.tika.parser.audio.MidiParser"/> <parser class="org.apache.tika.parser.chm.ChmParser"/> <parser class="org.apache.tika.parser.code.SourceCodeParser"/> <parser class="org.apache.tika.parser.crypto.Pkcs7Parser"/> <parser class="org.apache.tika.parser.crypto.TSDParser"/> <parser class="org.apache.tika.parser.csv.TextAndCSVParser"/> <parser class="org.apache.tika.parser.dbf.DBFParser"/> <parser class="org.apache.tika.parser.dif.DIFParser"/> <parser class="org.apache.tika.parser.dwg.DWGParser"/> <parser class="org.apache.tika.parser.epub.EpubParser"/> <parser class="org.apache.tika.parser.executable.ExecutableParser"/> <parser class="org.apache.tika.parser.feed.FeedParser"/> <parser class="org.apache.tika.parser.font.AdobeFontMetricParser"/> <parser class="org.apache.tika.parser.font.TrueTypeParser"/> <parser class="org.apache.tika.parser.gdal.GDALParser"/> <parser class="org.apache.tika.parser.geoinfo.GeographicInformationParser"/> <parser class="org.apache.tika.parser.grib.GribParser"/> <parser class="org.apache.tika.parser.hdf.HDFParser"/> <parser class="org.apache.tika.parser.html.HtmlParser"/> <parser class="org.apache.tika.parser.hwp.HwpV5Parser"/> <parser class="org.apache.tika.parser.image.BPGParser"/> <parser class="org.apache.tika.parser.image.ICNSParser"/> <parser class="org.apache.tika.parser.image.ImageParser"/> <parser class="org.apache.tika.parser.image.PSDParser"/> <parser class="org.apache.tika.parser.image.TiffParser"/> <parser class="org.apache.tika.parser.image.WebPParser"/> <parser class="org.apache.tika.parser.iptc.IptcAnpaParser"/> <parser class="org.apache.tika.parser.isatab.ISArchiveParser"/> <parser class="org.apache.tika.parser.iwork.IWorkPackageParser"/> <parser class="org.apache.tika.parser.jdbc.SQLite3Parser"/> <parser class="org.apache.tika.parser.jpeg.JpegParser"/> <parser class="org.apache.tika.parser.mail.RFC822Parser"/> <parser class="org.apache.tika.parser.mat.MatParser"/> <parser class="org.apache.tika.parser.mbox.MboxParser"/> <parser class="org.apache.tika.parser.mbox.OutlookPSTParser"/> <parser class="org.apache.tika.parser.microsoft.EMFParser"/> <parser class="org.apache.tika.parser.microsoft.JackcessParser"/> <parser class="org.apache.tika.parser.microsoft.MSOwnerFileParser"/> <parser class="org.apache.tika.parser.microsoft.OfficeParser"/> <parser class="org.apache.tika.parser.microsoft.OldExcelParser"/> <parser class="org.apache.tika.parser.microsoft.TNEFParser"/> <parser class="org.apache.tika.parser.microsoft.WMFParser"/> <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> <parser class="org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.Word2006MLParser"/> <parser class="org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser"/> <parser class="org.apache.tika.parser.microsoft.xml.WordMLParser"/> <parser class="org.apache.tika.parser.mp3.Mp3Parser"/> <parser class="org.apache.tika.parser.mp4.MP4Parser"/> <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/> <parser class="org.apache.tika.parser.odf.OpenDocumentParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"/> <parser class="org.apache.tika.parser.pkg.CompressorParser"/> <parser class="org.apache.tika.parser.pkg.PackageParser"/> <parser class="org.apache.tika.parser.pkg.RarParser"/> <parser class="org.apache.tika.parser.rtf.RTFParser"/> <parser class="org.apache.tika.parser.sas.SAS7BDATParser"/> <parser class="org.apache.tika.parser.video.FLVParser"/> <parser class="org.apache.tika.parser.wordperfect.QuattroProParser"/> <parser class="org.apache.tika.parser.wordperfect.WordPerfectParser"/> <parser class="org.apache.tika.parser.xliff.XLIFF12Parser"/> <parser class="org.apache.tika.parser.xliff.XLZParser"/> <parser class="org.apache.tika.parser.xml.DcXMLParser"/> <parser class="org.apache.tika.parser.xml.FictionBookParser"/> <parser class="org.gagravarr.tika.FlacParser"/> <parser class="org.gagravarr.tika.OggParser"/> <parser class="org.gagravarr.tika.OpusParser"/> <parser class="org.gagravarr.tika.SpeexParser"/> <parser class="org.gagravarr.tika.TheoraParser"/> <parser class="org.gagravarr.tika.VorbisParser"/> </parsers> </properties> -- This message was sent by Atlassian Jira (v8.3.4#803005)