Thank you, Yegor. Y, I realized my initial catch was too narrow; I'll expand that generally, especially for image/pict. Thank you, again!
-----Original Message----- From: Yegor Kozlov [mailto:[email protected]] Sent: Saturday, November 5, 2016 10:20 AM To: POI Developers List <[email protected]> Subject: Re: zip exceptions in objects embedded in HSLF Hi Tim, Research Forum 2013.3.ppt attached to TIKA-2164 is not a valid PPT file. My PowerPoint 2013 cannot open it and displays "The selected file does not appear to be a valid Microsoft PowerPoint file.". It seems that the OLE2 filesystem in this file is invalid. POI fails before parsing the PPT data, the error happens when reading POIFS blocks : Exception in thread "main" java.lang.IndexOutOfBoundsException: Block 24081 not found at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:458) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:411) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:335) In all other cases the code fails when reading data from "image/pict" objects. I see your commit to swallow "incorrect data check" error messages. This is not to be relied on. I tried files attached to TIKA-2164 and the exception differs. In case of Jankovic final Retreat 2002.PPT the error is java.util.zip.ZipException: invalid literal/length code. In case of Lab Meeting.ppt the error is java.io.EOFException: Unexpected end of ZLIB input stream In case of paperfigures.ppt the error is java.util.zip.ZipException: invalid distance too far back Apparently we don't handle all cases when reading PICT files. I'm not sure how much is the effort to fix it, but for now can you swallow all errors for the "image/pict" content type? It is a known troublemaker and the best you can do for now is to catch all its exceptions. Yegor On Fri, Nov 4, 2016 at 9:25 PM, Allison, Timothy B. <[email protected]> wrote: > And for a larger collection of zip exceptions in embedded HSLF, see > TIKA-2164. > > -----Original Message----- > From: Allison, Timothy B. [mailto:[email protected]] > Sent: Friday, November 4, 2016 11:49 AM > To: POI Users List <[email protected]> > Subject: zip exceptions in objects embedded in HSLF > > POI Colleagues, > On TIKA-2157 and TIKA-2130, Seva Alekseyev attached files that > trigger a ZipException on an object embedded within a ppt. We've seen > these in our regression corpus as well. For now, we're swallowing > these in Tika. If anyone has a chance to look into those triggering > files to figure out if the embedded files are truly corrupt or if this > is something we can fix in POI, I'd appreciate it. I investigated a > bit with TIKA-2130's file, and it _looks_ to me like the zip stream is > truly corrupt, but this area of the code base is not one of my strengths. > Thank you. > > Cheers, > > Tim > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
