[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie
[ https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2217: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.RuntimeException for : "Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1" java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at
[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie
[ https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2217: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor47.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 was: java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Description: java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 was: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2217) RuntimeException on a PPT with a movie
Seva Alekseyev created TIKA-2217: Summary: RuntimeException on a PPT with a movie Key: TIKA-2217 URL: https://issues.apache.org/jira/browse/TIKA-2217 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev java.lang.RuntimeException for 63933/<\\ai-storm\FScan\Scan_2016-12-16_01-06-55\Folders\75457622\lecture WH 2002.ppt>: "Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1" java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Attachment: TB Coord RFCb.doc > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2216: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2216 URL: https://issues.apache.org/jira/browse/TIKA-2216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file
Seva Alekseyev created TIKA-2215: Summary: TikaException about "Invalid embedded resource" on a valid PPT file Key: TIKA-2215 URL: https://issues.apache.org/jira/browse/TIKA-2215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Iverson.ppt On the attached file, which opens with PowerPoint, the Tika parser throws the following error: org.apache.tika.exception.TikaException: Invalid embedded resource at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 16706699264 in stream of length 164352 at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42 at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file
[ https://issues.apache.org/jira/browse/TIKA-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2215: - Attachment: Iverson.ppt > TikaException about "Invalid embedded resource" on a valid PPT file > --- > > Key: TIKA-2215 > URL: https://issues.apache.org/jira/browse/TIKA-2215 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Iverson.ppt > > > On the attached file, which opens with PowerPoint, the Tika parser throws the > following error: > org.apache.tika.exception.TikaException: Invalid embedded resource > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 > at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from > 16706699264 in stream of length 164352 > at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42 > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 > at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2214: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2214 URL: https://issues.apache.org/jira/browse/TIKA-2214 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: NONCONT.DOC On the attached file, which opens with Word, the Tika parser throws the following error: java.lang.ArrayIndexOutOfBoundsException: at java.lang.System.arraycopy:-2 at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171 at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101 at org.apache.poi.hwpf.model.OldPAPBinTable.:49 at org.apache.poi.hwpf.HWPFOldDocument.:105 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2214: - Attachment: NONCONT.DOC > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2214 > URL: https://issues.apache.org/jira/browse/TIKA-2214 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: NONCONT.DOC > > > On the attached file, which opens with Word, the Tika parser throws the > following error: > java.lang.ArrayIndexOutOfBoundsException: > at java.lang.System.arraycopy:-2 > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171 > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101 > at org.apache.poi.hwpf.model.OldPAPBinTable.:49 > at org.apache.poi.hwpf.HWPFOldDocument.:105 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2213: - Attachment: biennial - 96.doc > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2213 > URL: https://issues.apache.org/jira/browse/TIKA-2213 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: biennial - 96.doc > > > On the attached file, which opens in Word, Tika parser throws the following > error: > java.lang.ArrayIndexOutOfBoundsException: > at java.lang.System.arraycopy:-2 > at org.apache.poi.hwpf.model.TextPieceTable.:109 > at org.apache.poi.hwpf.model.ComplexFileTable.:70 > at org.apache.poi.hwpf.HWPFOldDocument.:68 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2213: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2213 URL: https://issues.apache.org/jira/browse/TIKA-2213 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached file, which opens in Word, Tika parser throws the following error: java.lang.ArrayIndexOutOfBoundsException: at java.lang.System.arraycopy:-2 at org.apache.poi.hwpf.model.TextPieceTable.:109 at org.apache.poi.hwpf.model.ComplexFileTable.:70 at org.apache.poi.hwpf.HWPFOldDocument.:68 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2207) ArrayIndexOutOfBoundsException on a valid Excel file
Seva Alekseyev created TIKA-2207: Summary: ArrayIndexOutOfBoundsException on a valid Excel file Key: TIKA-2207 URL: https://issues.apache.org/jira/browse/TIKA-2207 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Merck 9333 MPS 9-22-16.xlsx The attached file, which opens in Excel, errors out in Tika: java.lang.ArrayIndexOutOfBoundsException: 32 at org.apache.commons.compress.compressors.lzw.LZWInputStream.initializeTables:126 at org.apache.commons.compress.compressors.z.ZCompressorInputStream.:54 at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream:237 at org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat:109 at org.apache.tika.parser.pkg.ZipContainerDetector.detect:95 at org.apache.tika.detect.CompositeDetector.detect:77 at org.apache.tika.parser.AutoDetectParser.parse:112 at org.apache.tika.parser.DelegatingParser.parse:72 at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded:102 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:245 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115 at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2207) ArrayIndexOutOfBoundsException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2207: - Attachment: Merck 9333 MPS 9-22-16.xlsx > ArrayIndexOutOfBoundsException on a valid Excel file > > > Key: TIKA-2207 > URL: https://issues.apache.org/jira/browse/TIKA-2207 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Merck 9333 MPS 9-22-16.xlsx > > > The attached file, which opens in Excel, errors out in Tika: > java.lang.ArrayIndexOutOfBoundsException: 32 > at > org.apache.commons.compress.compressors.lzw.LZWInputStream.initializeTables:126 > at > org.apache.commons.compress.compressors.z.ZCompressorInputStream.:54 > at > org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream:237 > at > org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat:109 > at org.apache.tika.parser.pkg.ZipContainerDetector.detect:95 > at org.apache.tika.detect.CompositeDetector.detect:77 > at org.apache.tika.parser.AutoDetectParser.parse:112 > at org.apache.tika.parser.DelegatingParser.parse:72 > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded:102 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:245 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105 > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2206) RecordFormatException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2206: - Attachment: Budget_storyboard_V2_06282013.xls > RecordFormatException on a valid Excel file > --- > > Key: TIKA-2206 > URL: https://issues.apache.org/jira/browse/TIKA-2206 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Budget_storyboard_V2_06282013.xls > > > The attached file, which opens fine in Excel, errors out in Tika: > org.apache.poi.hssf.record.RecordFormatException for > 63773/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\70664525\Budget_storyboard_V2_06282013.xls>: > "Leftover 3 bytes in subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, > 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, > 00, 00, 00, 00, 00, 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, > EE, 1F, 12, 00, 0B, 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, > 00, 00, 00, 03, 00, 00, 00, 18, 00, 00, 00, 00, 01, 00]" > org.apache.poi.hssf.record.RecordFormatException: Leftover 3 bytes in > subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 00, 00, 00, 00, 00, > 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 00, 00, 00, 00, > 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 12, 00, 0B, > 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 03, 00, > 00, 00, 18, 00, 00, 00, 00, 01, 00] > at org.apache.poi.hssf.record.ObjRecord.:108 > at sun.reflect.GeneratedConstructorAccessor14.newInstance:-1 > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 > at java.lang.reflect.Constructor.newInstance:-1 > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84 > at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 > at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2206) RecordFormatException on a valid Excel file
Seva Alekseyev created TIKA-2206: Summary: RecordFormatException on a valid Excel file Key: TIKA-2206 URL: https://issues.apache.org/jira/browse/TIKA-2206 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev The attached file, which opens fine in Excel, errors out in Tika: org.apache.poi.hssf.record.RecordFormatException for 63773/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\70664525\Budget_storyboard_V2_06282013.xls>: "Leftover 3 bytes in subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 12, 00, 0B, 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 03, 00, 00, 00, 18, 00, 00, 00, 00, 01, 00]" org.apache.poi.hssf.record.RecordFormatException: Leftover 3 bytes in subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 12, 00, 0B, 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 03, 00, 00, 00, 18, 00, 00, 00, 00, 01, 00] at org.apache.poi.hssf.record.ObjRecord.:108 at sun.reflect.GeneratedConstructorAccessor14.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84 at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2205) IllegalArgumentException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2205: - Attachment: SAT19-11-25-09_Selected Dates.xls > IllegalArgumentException on a valid Excel file > -- > > Key: TIKA-2205 > URL: https://issues.apache.org/jira/browse/TIKA-2205 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: SAT19-11-25-09_Selected Dates.xls > > > The attached file, which opens in Excel, errors out in Tika: > java.lang.IllegalArgumentException: Cannot format given Object as a Number > at java.text.DecimalFormat.format:-1 > at java.text.Format.format:-1 > at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 > at > org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 > at > org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 > at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2205) IllegalArgumentException on a valid Excel file
Seva Alekseyev created TIKA-2205: Summary: IllegalArgumentException on a valid Excel file Key: TIKA-2205 URL: https://issues.apache.org/jira/browse/TIKA-2205 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev The attached file, which opens in Excel, errors out in Tika: java.lang.IllegalArgumentException: Cannot format given Object as a Number at java.text.DecimalFormat.format:-1 at java.text.Format.format:-1 at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 org.apache.tika.exception.TikaException for 63269/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\51541330\engelAPBD copy.pptx>: "Error creating OOXML extractor" org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2205) IllegalArgumentException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2205: - Description: The attached file, which opens in Excel, errors out in Tika: java.lang.IllegalArgumentException: Cannot format given Object as a Number at java.text.DecimalFormat.format:-1 at java.text.Format.format:-1 at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 was: The attached file, which opens in Excel, errors out in Tika: java.lang.IllegalArgumentException: Cannot format given Object as a Number at java.text.DecimalFormat.format:-1 at java.text.Format.format:-1 at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 org.apache.tika.exception.TikaException for 63269/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\51541330\engelAPBD copy.pptx>: "Error creating OOXML extractor" org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > IllegalArgumentException on a valid Excel file > -- > > Key: TIKA-2205 > URL: https://issues.apache.org/jira/browse/TIKA-2205 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: SAT19-11-25-09_Selected Dates.xls > > > The attached file, which opens in Excel, errors out in Tika: > java.lang.IllegalArgumentException: Cannot format given Object as a Number > at java.text.DecimalFormat.format:-1 > at java.text.Format.format:-1 > at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 > at > org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 > at >
[jira] [Created] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file
Seva Alekseyev created TIKA-2204: Summary: IndexOutOfBoundsException on a valid Powerpoint file Key: TIKA-2204 URL: https://issues.apache.org/jira/browse/TIKA-2204 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: 061511.pptx The attached file, which opens in Powerpoint, errors in Tika: java.lang.IndexOutOfBoundsException: Block 733 not found at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449 at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335 at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2204: - Attachment: 061511.pptx > IndexOutOfBoundsException on a valid Powerpoint file > > > Key: TIKA-2204 > URL: https://issues.apache.org/jira/browse/TIKA-2204 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: 061511.pptx > > > The attached file, which opens in Powerpoint, errors in Tika: > java.lang.IndexOutOfBoundsException: Block 733 not found > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449 > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335 > at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115 > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2203) InvalidOperationException on a valid Word file
Seva Alekseyev created TIKA-2203: Summary: InvalidOperationException on a valid Word file Key: TIKA-2203 URL: https://issues.apache.org/jira/browse/TIKA-2203 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: OPCCompliance_DerivedPartNameFAIL.docx The attached Word file, which opens in Word, errors out in Tika: org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:123 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: You can't add a part with a part name derived from another part ! [M1.11] at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:338 at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774 at org.apache.poi.openxml4j.opc.OPCPackage.open:268 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: You can't add a part with a part name derived from another part ! [M1.11] at org.apache.poi.openxml4j.opc.PackagePartCollection.put:66 at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:336 at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774 at org.apache.poi.openxml4j.opc.OPCPackage.open:268 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2203) InvalidOperationException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2203: - Attachment: OPCCompliance_DerivedPartNameFAIL.docx > InvalidOperationException on a valid Word file > -- > > Key: TIKA-2203 > URL: https://issues.apache.org/jira/browse/TIKA-2203 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: OPCCompliance_DerivedPartNameFAIL.docx > > > The attached Word file, which opens in Word, errors out in Tika: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:123 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > at gov.nih.niaid.fscanner.Extract.ExtractContents:69 > Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: You > can't add a part with a part name derived from another part ! [M1.11] > at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:338 > at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774 > at org.apache.poi.openxml4j.opc.OPCPackage.open:268 > at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: You > can't add a part with a part name derived from another part ! [M1.11] > at org.apache.poi.openxml4j.opc.PackagePartCollection.put:66 > at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:336 > at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774 > at org.apache.poi.openxml4j.opc.OPCPackage.open:268 > at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2202) StringIndexOutOfBoundsException on a valid Word document
[ https://issues.apache.org/jira/browse/TIKA-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2202: - Attachment: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc > StringIndexOutOfBoundsException on a valid Word document > > > Key: TIKA-2202 > URL: https://issues.apache.org/jira/browse/TIKA-2202 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc > > > The attachged document, which opens in Word, errors out in Tika: > java.lang.StringIndexOutOfBoundsException: String index out of range: 0 > at java.lang.String.charAt:-1 > at > org.apache.tika.parser.microsoft.ListManager.convertToNewNumberText:152 > at org.apache.tika.parser.microsoft.ListManager.buildTuple:111 > at org.apache.tika.parser.microsoft.ListManager.getFormattedNumber:86 > at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph:298 > at org.apache.tika.parser.microsoft.WordExtractor.parse:179 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2202) StringIndexOutOfBoundsException on a valid Word document
Seva Alekseyev created TIKA-2202: Summary: StringIndexOutOfBoundsException on a valid Word document Key: TIKA-2202 URL: https://issues.apache.org/jira/browse/TIKA-2202 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc The attachged document, which opens in Word, errors out in Tika: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt:-1 at org.apache.tika.parser.microsoft.ListManager.convertToNewNumberText:152 at org.apache.tika.parser.microsoft.ListManager.buildTuple:111 at org.apache.tika.parser.microsoft.ListManager.getFormattedNumber:86 at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph:298 at org.apache.tika.parser.microsoft.WordExtractor.parse:179 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2201) OutOfMemoryError on a reasonably sized document
[ https://issues.apache.org/jira/browse/TIKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2201: - Description: The following document, which is not particularly big, causes an OOM in Tika parser: https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx Java memory limit is 4GB. was:The attached document, which is not particularly big, causes an OOM in Tika parser. > OutOfMemoryError on a reasonably sized document > --- > > Key: TIKA-2201 > URL: https://issues.apache.org/jira/browse/TIKA-2201 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > The following document, which is not particularly big, causes an OOM in Tika > parser: > https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx > Java memory limit is 4GB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2201) OutOfMemoryError on a reasonably sized document
Seva Alekseyev created TIKA-2201: Summary: OutOfMemoryError on a reasonably sized document Key: TIKA-2201 URL: https://issues.apache.org/jira/browse/TIKA-2201 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev The attached document, which is not particularly big, causes an OOM in Tika parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2197) TikaException from invalid URL in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2197: - Attachment: NCI WH presentation JAC 3-23-15_234pm.pptx > TikaException from invalid URL in an Excel document > --- > > Key: TIKA-2197 > URL: https://issues.apache.org/jira/browse/TIKA-2197 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: NCI WH presentation JAC 3-23-15_234pm.pptx, > Neut_paratope_updated_0813_naming_formattable.xlsx > > > The attached document, which open fine in Excel (if slowly), causes the > following error in the Tika parser: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > Caused by: java.lang.IllegalArgumentException: targetUri invalid - > http://invalid.uri > at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427 > at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105 > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2200) XML schema mismatch error on a valid Word document
Seva Alekseyev created TIKA-2200: Summary: XML schema mismatch error on a valid Word document Key: TIKA-2200 URL: https://issues.apache.org/jira/browse/TIKA-2200 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: MK2048_FROM_ISENTRIS.docx The attached document, which opens in Word, errors out in Tika: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: error: The document is not a document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name mismatch expected document got wordDocument at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:241 at org.apache.poi.POIXMLDocument.load:190 at org.apache.poi.xwpf.usermodel.XWPFDocument.:124 at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58 at org.apache.poi.extractor.ExtractorFactory.createExtractor:232 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 Caused by: org.apache.xmlbeans.XmlException: error: The document is not a document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name mismatch expected document got wordDocument at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType:459 at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument:364 at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1391 at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1370 at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse:370 at org.apache.poi.POIXMLTypeLoader.parse:116 at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse:-1 at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:164 at org.apache.poi.POIXMLDocument.load:190 at org.apache.poi.xwpf.usermodel.XWPFDocument.:124 at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58 at org.apache.poi.extractor.ExtractorFactory.createExtractor:232 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2200) XML schema mismatch error on a valid Word document
[ https://issues.apache.org/jira/browse/TIKA-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2200: - Attachment: MK2048_FROM_ISENTRIS.docx > XML schema mismatch error on a valid Word document > -- > > Key: TIKA-2200 > URL: https://issues.apache.org/jira/browse/TIKA-2200 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: MK2048_FROM_ISENTRIS.docx > > > The attached document, which opens in Word, errors out in Tika: > org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: error: The > document is not a > document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: > document element local name mismatch expected document got wordDocument > at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:241 > at org.apache.poi.POIXMLDocument.load:190 > at org.apache.poi.xwpf.usermodel.XWPFDocument.:124 > at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58 > at org.apache.poi.extractor.ExtractorFactory.createExtractor:232 > at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > Caused by: org.apache.xmlbeans.XmlException: error: The document is not a > document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: > document element local name mismatch expected document got wordDocument > at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType:459 > at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument:364 > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1391 > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1370 > at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse:370 > at org.apache.poi.POIXMLTypeLoader.parse:116 > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse:-1 > at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:164 > at org.apache.poi.POIXMLDocument.load:190 > at org.apache.poi.xwpf.usermodel.XWPFDocument.:124 > at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58 > at org.apache.poi.extractor.ExtractorFactory.createExtractor:232 > at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2199) RecordFormatException on a valid Excel file
Seva Alekseyev created TIKA-2199: Summary: RecordFormatException on a valid Excel file Key: TIKA-2199 URL: https://issues.apache.org/jira/browse/TIKA-2199 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: CDC survcost.xls The attached file, which opens in Excel, causes an error in Tika parser: org.apache.poi.util.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:98 at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IllegalArgumentException: Start index must be less than end index. at org.apache.poi.hssf.usermodel.HSSFRichTextString.applyFont:136 at org.apache.poi.hssf.record.TextObjectRecord.processFontRuns:155 at org.apache.poi.hssf.record.TextObjectRecord.:131 at sun.reflect.GeneratedConstructorAccessor19.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84 at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2199) RecordFormatException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2199: - Attachment: CDC survcost.xls > RecordFormatException on a valid Excel file > --- > > Key: TIKA-2199 > URL: https://issues.apache.org/jira/browse/TIKA-2199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: CDC survcost.xls > > > The attached file, which opens in Excel, causes an error in Tika parser: > org.apache.poi.util.RecordFormatException: Unable to construct record instance > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:98 > at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 > at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > Caused by: java.lang.IllegalArgumentException: Start index must be less than > end index. > at org.apache.poi.hssf.usermodel.HSSFRichTextString.applyFont:136 > at org.apache.poi.hssf.record.TextObjectRecord.processFontRuns:155 > at org.apache.poi.hssf.record.TextObjectRecord.:131 > at sun.reflect.GeneratedConstructorAccessor19.newInstance:-1 > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 > at java.lang.reflect.Constructor.newInstance:-1 > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84 > at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 > at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2198) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2198: - Attachment: CIPRA SA concept project 2 rev JM.doc > NullPointerException on a valid Word file > - > > Key: TIKA-2198 > URL: https://issues.apache.org/jira/browse/TIKA-2198 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: CIPRA SA concept project 2 rev JM.doc > > > On the attached file, which opens fine in Word, the Tika parser throws the > following error: > java.lang.NullPointerException: > at org.apache.poi.hwpf.model.ListTables.getLevel:141 > at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph:125 > at org.apache.poi.hwpf.usermodel.Range.getParagraph:766 > at org.apache.tika.parser.microsoft.WordExtractor.parse:178 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2198) NullPointerException on a valid Word file
Seva Alekseyev created TIKA-2198: Summary: NullPointerException on a valid Word file Key: TIKA-2198 URL: https://issues.apache.org/jira/browse/TIKA-2198 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached file, which opens fine in Word, the Tika parser throws the following error: java.lang.NullPointerException: at org.apache.poi.hwpf.model.ListTables.getLevel:141 at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph:125 at org.apache.poi.hwpf.usermodel.Range.getParagraph:766 at org.apache.tika.parser.microsoft.WordExtractor.parse:178 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2197) TikaException from invalid URL in an Excel document
Seva Alekseyev created TIKA-2197: Summary: TikaException from invalid URL in an Excel document Key: TIKA-2197 URL: https://issues.apache.org/jira/browse/TIKA-2197 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Neut_paratope_updated_0813_naming_formattable.xlsx The attached document, which open fine in Excel (if slowly), causes the following error in the Tika parser: org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 Caused by: java.lang.IllegalArgumentException: targetUri invalid - http://invalid.uri at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427 at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206 at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182 at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112 at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2197) TikaException from invalid URL in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2197: - Attachment: Neut_paratope_updated_0813_naming_formattable.xlsx > TikaException from invalid URL in an Excel document > --- > > Key: TIKA-2197 > URL: https://issues.apache.org/jira/browse/TIKA-2197 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Neut_paratope_updated_0813_naming_formattable.xlsx > > > The attached document, which open fine in Excel (if slowly), causes the > following error in the Tika parser: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > Caused by: java.lang.IllegalArgumentException: targetUri invalid - > http://invalid.uri > at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427 > at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134 > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112 > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105 > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112 > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87 > at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2196) IllegalArgumentException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2196: - Attachment: 2007 Experiment watch.xls > IllegalArgumentException on a valid Excel file > -- > > Key: TIKA-2196 > URL: https://issues.apache.org/jira/browse/TIKA-2196 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: 2007 Experiment watch.xls > > > On the attached Excel file, which opens fine in Excel, Tika throws the > following error: > java.lang.IllegalArgumentException: Cannot format given Object as a Number > at java.text.DecimalFormat.format:-1 > at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67 > at java.text.Format.format:-1 > at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 > at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 > at > org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 > at > org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 > at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2196) IllegalArgumentException on a valid Excel file
Seva Alekseyev created TIKA-2196: Summary: IllegalArgumentException on a valid Excel file Key: TIKA-2196 URL: https://issues.apache.org/jira/browse/TIKA-2196 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: 2007 Experiment watch.xls On the attached Excel file, which opens fine in Excel, Tika throws the following error: java.lang.IllegalArgumentException: Cannot format given Object as a Number at java.text.DecimalFormat.format:-1 at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67 at java.text.Format.format:-1 at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804 at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336 at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92 at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2185) NegativeArraySizeException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2185: - Attachment: PatentW final.doc > NegativeArraySizeException on a valid Word file > --- > > Key: TIKA-2185 > URL: https://issues.apache.org/jira/browse/TIKA-2185 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: PatentW final.doc > > > On the attached document, which opens fine with Word, the Tika parser throws > the following: > java.lang.NegativeArraySizeException: > at org.apache.poi.hwpf.model.StyleDescription.:122 > at org.apache.poi.hwpf.model.StyleSheet.:107 > at org.apache.poi.hwpf.HWPFDocument.:289 > at org.apache.tika.parser.microsoft.WordExtractor.parse:151 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2185) NegativeArraySizeException on a valid Word file
Seva Alekseyev created TIKA-2185: Summary: NegativeArraySizeException on a valid Word file Key: TIKA-2185 URL: https://issues.apache.org/jira/browse/TIKA-2185 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: PatentW final.doc On the attached document, which opens fine with Word, the Tika parser throws the following: java.lang.NegativeArraySizeException: at org.apache.poi.hwpf.model.StyleDescription.:122 at org.apache.poi.hwpf.model.StyleSheet.:107 at org.apache.poi.hwpf.HWPFDocument.:289 at org.apache.tika.parser.microsoft.WordExtractor.parse:151 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2184) RecordFormatException on a valid Excel file
Seva Alekseyev created TIKA-2184: Summary: RecordFormatException on a valid Excel file Key: TIKA-2184 URL: https://issues.apache.org/jira/browse/TIKA-2184 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: HIVT Discrepancy Report- 3-29-04UCSF.xls On the attached file, which opens fine with Excel, the Tika parser throws the following: org.apache.poi.hssf.record.RecordFormatException: Unhandled Continue Record followining class org.apache.poi.hssf.record.TabIdRecord at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:379 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2184) RecordFormatException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2184: - Attachment: HIVT Discrepancy Report- 3-29-04UCSF.xls > RecordFormatException on a valid Excel file > --- > > Key: TIKA-2184 > URL: https://issues.apache.org/jira/browse/TIKA-2184 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: HIVT Discrepancy Report- 3-29-04UCSF.xls > > > On the attached file, which opens fine with Excel, the Tika parser throws the > following: > org.apache.poi.hssf.record.RecordFormatException: Unhandled Continue Record > followining class org.apache.poi.hssf.record.TabIdRecord > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:379 > at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 > at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 > at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:177 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx EDIT: similar exception on daids.ppt EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162 EDIT: "Marcia Lecture.PPT" was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Attachment: Marcia Lecture.PPT > TaggedIOException on a valid Powerpoint file > > > Key: TIKA-2153 > URL: https://issues.apache.org/jira/browse/TIKA-2153 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: IAVI Team meeting FINAL.ppt, Jinwoo_032910.pptx, Marcia > Lecture.PPT, daids.ppt, tika_2153_unzipping.png > > > On the following Powerpoint file, which opens fine with Powerpoint: > https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx > the Tika parses throws the following error: > org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) > at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) > ... 13 more > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 19 more > Could be similar to #2130. > EDIT: similar exception on the attached Jinwoo_032910.pptx > EDIT: similar exception on daids.ppt > EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx EDIT: similar exception on daids.ppt EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162 was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Attachment: IAVI Team meeting FINAL.ppt > TaggedIOException on a valid Powerpoint file > > > Key: TIKA-2153 > URL: https://issues.apache.org/jira/browse/TIKA-2153 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: IAVI Team meeting FINAL.ppt, Jinwoo_032910.pptx, > daids.ppt, tika_2153_unzipping.png > > > On the following Powerpoint file, which opens fine with Powerpoint: > https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx > the Tika parses throws the following error: > org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) > at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) > ... 13 more > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 19 more > Could be similar to #2130. > EDIT: similar exception on the attached Jinwoo_032910.pptx > EDIT: similar exception on daids.ppt > EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx EDIT: similar exception on daids.pptx was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Attachment: daids.ppt > TaggedIOException on a valid Powerpoint file > > > Key: TIKA-2153 > URL: https://issues.apache.org/jira/browse/TIKA-2153 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jinwoo_032910.pptx, daids.ppt, tika_2153_unzipping.png > > > On the following Powerpoint file, which opens fine with Powerpoint: > https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx > the Tika parses throws the following error: > org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) > at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) > ... 13 more > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 19 more > Could be similar to #2130. > EDIT: similar exception on the attached Jinwoo_032910.pptx > EDIT: similar exception on daids.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx EDIT: similar exception on daids.ppt was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Attachment: Jinwoo_032910.pptx > TaggedIOException on a valid Powerpoint file > > > Key: TIKA-2153 > URL: https://issues.apache.org/jira/browse/TIKA-2153 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jinwoo_032910.pptx, tika_2153_unzipping.png > > > On the following Powerpoint file, which opens fine with Powerpoint: > https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx > the Tika parses throws the following error: > org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) > at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) > ... 13 more > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 19 more > Could be similar to #2130. > EDIT: similar exception on the attached Jinwoo_032910.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at
[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2161: - Description: On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at java.nio.file.Files.copy(Files.java:2908) at java.nio.file.Files.copy(Files.java:3027) at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) at org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 22 more EDIT: Tika 1.14 throws EOFException was: On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at java.nio.file.Files.copy(Files.java:2908) at java.nio.file.Files.copy(Files.java:3027) at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) at
[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2161: - Summary: EOFException on a valid Powerpoint file (was: TaggedIOException from EOFException on a valid Powerpoint file) > EOFException on a valid Powerpoint file > --- > > Key: TIKA-2161 > URL: https://issues.apache.org/jira/browse/TIKA-2161 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Erik-LymeChipBranchSeminar.ppt > > > On the attached Powerpoint file, which opens fine with Powerpoint, the Tika > parser throws the following error: > org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at java.nio.file.Files.copy(Files.java:2908) > at java.nio.file.Files.copy(Files.java:3027) > at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) > at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.io.EOFException: Unexpected end of ZLIB input stream > at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) > at > org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 22 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2166: - Attachment: AMSMIC briefing doc.docx > TaggedIOException from a ZipException on a valid Word file > -- > > Key: TIKA-2166 > URL: https://issues.apache.org/jira/browse/TIKA-2166 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: AMSMIC briefing doc.docx > > > On the attached file, which opens with Word, Tika throws: > org.apache.tika.io.TaggedIOException: invalid block type > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) > at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63) > at gov.nih.niaid.temp.Main.main(Main.java:68) > Caused by: org.apache.tika.io.TaggedIOException: invalid block type > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) > ... 12 more > Caused by: java.util.zip.ZipException: invalid block type > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file
Seva Alekseyev created TIKA-2166: Summary: TaggedIOException from a ZipException on a valid Word file Key: TIKA-2166 URL: https://issues.apache.org/jira/browse/TIKA-2166 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached file, which opens with Word, Tika throws: org.apache.tika.io.TaggedIOException: invalid block type at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63) at gov.nih.niaid.temp.Main.main(Main.java:68) Caused by: org.apache.tika.io.TaggedIOException: invalid block type at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) ... 12 more Caused by: java.util.zip.ZipException: invalid block type at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2165) NegativeArraySizeException on a valid Word file
Seva Alekseyev created TIKA-2165: Summary: NegativeArraySizeException on a valid Word file Key: TIKA-2165 URL: https://issues.apache.org/jira/browse/TIKA-2165 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached file, which opens with Word, the Tika parser throws an error: java.lang.NegativeArraySizeException at org.apache.poi.hwpf.model.Ffn.(Ffn.java:79) at org.apache.poi.hwpf.model.FontTable.(FontTable.java:66) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:344) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. EDIT3: the attached "paperfigures" file emits "invalid distance too far back". Something is wrong with ZIP in Powerpoints. EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream" "suba" exhibits a similar error, "invalid distance too far back" but in a different exception. was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. EDIT3: the attached "paperfigures" file emits "invalid distance too far back". Something is wrong with ZIP in Powerpoints. EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream" > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, > Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. >
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Attachment: suba.ppt > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, > Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. > EDIT3: the attached "paperfigures" file emits "invalid distance too far > back". Something is wrong with ZIP in Powerpoints. > EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Attachment: Lab Meeting.ppt > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, > Research Forum 2013.3.ppt, paperfigures.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. > EDIT3: the attached "paperfigures" file emits "invalid distance too far > back". Something is wrong with ZIP in Powerpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. EDIT3: the attached "paperfigures" file emits "invalid distance too far back". Something is wrong with ZIP in Powerpoints. EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream" was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. EDIT3: the attached "paperfigures" file emits "invalid distance too far back". Something is wrong with ZIP in Powerpoints. > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, > Research Forum 2013.3.ppt, paperfigures.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. > EDIT3: the attached "paperfigures" file emits "invalid distance too far > back". Something is wrong with ZIP in Powerpoints. > EDIT4: in "Lab meeting", it's "Unexpected end
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Attachment: paperfigures.ppt > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Research Forum > 2013.3.ppt, paperfigures.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. EDIT3: the attached "paperfigures" file emits "invalid distance too far back". Something is wrong with ZIP in Powerpoints. was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Research Forum > 2013.3.ppt, paperfigures.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. > EDIT3: the attached "paperfigures" file emits "invalid distance too far > back". Something is wrong with ZIP in Powerpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached "Research forum" file emits a similar error "invalid block type". EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid literal/length code" error. was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached file emits a similar error "invalid block type". > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Research Forum > 2013.3.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached "Research forum" file emits a similar error "invalid block > type". > EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar > "invalid literal/length code" error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Attachment: Jankovic final Retreat 2002.PPT > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jankovic final Retreat 2002.PPT, Research Forum > 2013.3.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached file emits a similar error "invalid block type". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the attached file emits a similar error "invalid block type". was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the other file emits a similar error "invalid block type". > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Research Forum 2013.3.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the attached file emits a similar error "invalid block type". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more EDIT: the other file emits a similar error "invalid block type". was: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Research Forum 2013.3.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more > EDIT: the other file emits a similar error "invalid block type". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Attachment: Research Forum 2013.3.ppt > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Research Forum 2013.3.ppt > > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2164: - Description: On the following Powerpoint file: https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more was: On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more > HSLFException from ZipException "invalid stored block lengths" on a valid > Powerpoint file > - > > Key: TIKA-2164 > URL: https://issues.apache.org/jira/browse/TIKA-2164 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > On the following Powerpoint file: > https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt > which opens fine with Powerpoint, the Tika parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > invalid stored block lengths > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) > ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file
Seva Alekseyev created TIKA-2164: Summary: HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file Key: TIKA-2164 URL: https://issues.apache.org/jira/browse/TIKA-2164 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: invalid stored block lengths at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58) ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2163: - Attachment: ChronologicalResume.dotx > POIXMLException from ClassCastException on a valid Word template > > > Key: TIKA-2163 > URL: https://issues.apache.org/jira/browse/TIKA-2163 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: ChronologicalResume.dotx > > > On the attached Word template, which opens fine with Word, the Tika parser > throws the following error: > org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException > at > org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65) > at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601) > at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57) > at > org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60) > ... 10 more > Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart > cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74) > at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54) > ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template
Seva Alekseyev created TIKA-2163: Summary: POIXMLException from ClassCastException on a valid Word template Key: TIKA-2163 URL: https://issues.apache.org/jira/browse/TIKA-2163 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: ChronologicalResume.dotx On the attached Word template, which opens fine with Word, the Tika parser throws the following error: org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException at org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65) at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601) at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57) at org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60) ... 10 more Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument at org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74) at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54) ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2162) "Unknown compression method" on a Powerpoint file
Seva Alekseyev created TIKA-2162: Summary: "Unknown compression method" on a Powerpoint file Key: TIKA-2162 URL: https://issues.apache.org/jira/browse/TIKA-2162 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: DECAY.ppt On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: unknown compression method at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.util.zip.ZipException: unknown compression method at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85) ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2162) "Unknown compression method" on a Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2162: - Attachment: DECAY.ppt > "Unknown compression method" on a Powerpoint file > - > > Key: TIKA-2162 > URL: https://issues.apache.org/jira/browse/TIKA-2162 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: DECAY.ppt > > > On the attached Powerpoint file, which opens fine with Powerpoint, the Tika > parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > unknown compression method > at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: unknown compression method > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85) > ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2161: - Attachment: Erik-LymeChipBranchSeminar.ppt > TaggedIOException from EOFException on a valid Powerpoint file > -- > > Key: TIKA-2161 > URL: https://issues.apache.org/jira/browse/TIKA-2161 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Erik-LymeChipBranchSeminar.ppt > > > On the attached Powerpoint file, which opens fine with Powerpoint, the Tika > parser throws the following error: > org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at java.nio.file.Files.copy(Files.java:2908) > at java.nio.file.Files.copy(Files.java:3027) > at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) > at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.io.EOFException: Unexpected end of ZLIB input stream > at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) > at > org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 22 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file
Seva Alekseyev created TIKA-2161: Summary: TaggedIOException from EOFException on a valid Powerpoint file Key: TIKA-2161 URL: https://issues.apache.org/jira/browse/TIKA-2161 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at java.nio.file.Files.copy(Files.java:2908) at java.nio.file.Files.copy(Files.java:3027) at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) at org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 22 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2160: - Attachment: test_16022016081053.docx > POIXMLException from NullPointerException on a valid Word file > -- > > Key: TIKA-2160 > URL: https://issues.apache.org/jira/browse/TIKA-2160 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: test_16022016081053.docx > > > On the attached word file, which opens fine with Word (albeit with no text), > the Tika parser throws the following error: > org.apache.poi.POIXMLException: java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37) > at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38) > at > org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124) > ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file
Seva Alekseyev created TIKA-2160: Summary: POIXMLException from NullPointerException on a valid Word file Key: TIKA-2160 URL: https://issues.apache.org/jira/browse/TIKA-2160 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached word file, which opens fine with Word (albeit with no text), the Tika parser throws the following error: org.apache.poi.POIXMLException: java.lang.NullPointerException at org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: java.lang.NullPointerException at org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37) at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38) at org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124) ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2158) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2158: - Attachment: RTOP_Template01112015063856.docx > NullPointerException on a valid Word file > - > > Key: TIKA-2158 > URL: https://issues.apache.org/jira/browse/TIKA-2158 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: RTOP_Template01112015063856.docx > > > On the attached Word file, which opens fine with Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49) > at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35) > at > org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2158) NullPointerException on a valid Word file
Seva Alekseyev created TIKA-2158: Summary: NullPointerException on a valid Word file Key: TIKA-2158 URL: https://issues.apache.org/jira/browse/TIKA-2158 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: RTOP_Template01112015063856.docx On the attached Word file, which opens fine with Word, the Tika parser throws the following error: java.lang.NullPointerException at org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49) at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35) at org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2157) HSLFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2157: - Attachment: CRADA 2-09 K Subbarao.ppt > HSLFException on a valid Powerpoint file > > > Key: TIKA-2157 > URL: https://issues.apache.org/jira/browse/TIKA-2157 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: CRADA 2-09 K Subbarao.ppt > > > On the attached Powerpoint file, which opens fine with Powerpoint, the Tika > parser throws the following error: > org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: > incorrect data check > at org.apache.poi.hslf.blip.PICT.getData(PICT.java:120) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.util.zip.ZipException: incorrect data check > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.poi.hslf.blip.PICT.read(PICT.java:133) > at org.apache.poi.hslf.blip.PICT.getData(PICT.java:116) > ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2155) IndexOutOfBoundsException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2155: - Attachment: Copy of [corrupted Unicode text].xlsx > IndexOutOfBoundsException on a valid Excel file > --- > > Key: TIKA-2155 > URL: https://issues.apache.org/jira/browse/TIKA-2155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Copy of [corrupted Unicode text].xlsx > > > On the attached Excel file, which opens fine with Excel, the Tika parser > throws the following error: > java.lang.IndexOutOfBoundsException: Index: 65535, Size: 251 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > org.apache.poi.xssf.model.StylesTable.getStyleAt(StylesTable.java:421) > at > org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.startElement(XSSFSheetXMLHandler.java:281) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.startElement(XSSFExcelExtractorDecorator.java:345) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509) > at > com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:182) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:356) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2786) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:195) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2155) IndexOutOfBoundsException on a valid Excel file
Seva Alekseyev created TIKA-2155: Summary: IndexOutOfBoundsException on a valid Excel file Key: TIKA-2155 URL: https://issues.apache.org/jira/browse/TIKA-2155 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Copy of [corrupted Unicode text].xlsx On the attached Excel file, which opens fine with Excel, the Tika parser throws the following error: java.lang.IndexOutOfBoundsException: Index: 65535, Size: 251 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.poi.xssf.model.StylesTable.getStyleAt(StylesTable.java:421) at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.startElement(XSSFSheetXMLHandler.java:281) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.startElement(XSSFExcelExtractorDecorator.java:345) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509) at com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:182) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:356) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2786) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:195) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2154) RecordFormatException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2154: - Attachment: Interface_Availability.xls > RecordFormatException on a valid Excel file > --- > > Key: TIKA-2154 > URL: https://issues.apache.org/jira/browse/TIKA-2154 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Interface_Availability.xls > > > On the attached XLS file, which opens fine with Excel, the Tika parser throws > the following error: > org.apache.poi.hssf.record.RecordFormatException: Unable to construct record > instance > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:98) > at > org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:334) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:308) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:274) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:155) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:118) > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:309) > at > org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:154) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.lang.IllegalStateException: Should never be called before end > of current record > at > org.apache.poi.hssf.record.RecordInputStream.isContinueNext(RecordInputStream.java:455) > at > org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:386) > at > org.apache.poi.hssf.record.RecordInputStream.readUnicodeLEString(RecordInputStream.java:342) > at org.apache.poi.hssf.record.FormatRecord.(FormatRecord.java:57) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84) > ... 11 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2154) RecordFormatException on a valid Excel file
Seva Alekseyev created TIKA-2154: Summary: RecordFormatException on a valid Excel file Key: TIKA-2154 URL: https://issues.apache.org/jira/browse/TIKA-2154 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached XLS file, which opens fine with Excel, the Tika parser throws the following error: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:98) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:334) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:308) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:274) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:155) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:118) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:309) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:154) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.lang.IllegalStateException: Should never be called before end of current record at org.apache.poi.hssf.record.RecordInputStream.isContinueNext(RecordInputStream.java:455) at org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:386) at org.apache.poi.hssf.record.RecordInputStream.readUnicodeLEString(RecordInputStream.java:342) at org.apache.poi.hssf.record.FormatRecord.(FormatRecord.java:57) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84) ... 11 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2153) TaggedIOException on a valid Powerpoint file
Seva Alekseyev created TIKA-2153: Summary: TaggedIOException on a valid Powerpoint file Key: TIKA-2153 URL: https://issues.apache.org/jira/browse/TIKA-2153 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2152) NullPointerException on a valid Word file
Seva Alekseyev created TIKA-2152: Summary: NullPointerException on a valid Word file Key: TIKA-2152 URL: https://issues.apache.org/jira/browse/TIKA-2152 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: A5346.docx On the attached Word document, which opens fine in Word, the Tika parser throws the following error: java.lang.NullPointerException at org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2152) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2152: - Attachment: A5346.docx > NullPointerException on a valid Word file > - > > Key: TIKA-2152 > URL: https://issues.apache.org/jira/browse/TIKA-2152 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: A5346.docx > > > On the attached Word document, which opens fine in Word, the Tika parser > throws the following error: > java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2144) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2144: - Attachment: (was: Proposal ID 17 Offeror ChromoLogic.docx) > NullPointerException on a valid Word file > - > > Key: TIKA-2144 > URL: https://issues.apache.org/jira/browse/TIKA-2144 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > On the attached Word file, which opens fine in Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2147) ClassCastException on a valid Word template
Seva Alekseyev created TIKA-2147: Summary: ClassCastException on a valid Word template Key: TIKA-2147 URL: https://issues.apache.org/jira/browse/TIKA-2147 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Forefront Fax.dotx On the attached document template, which opens fine in Word, the Tika parser throws the following error: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument at org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) at org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) at org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) at org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2147: - Attachment: Forefront Fax.dotx > ClassCastException on a valid Word template > --- > > Key: TIKA-2147 > URL: https://issues.apache.org/jira/browse/TIKA-2147 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Forefront Fax.dotx > > > On the attached document template, which opens fine in Word, the Tika parser > throws the following error: > java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be > cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611753#comment-15611753 ] Seva Alekseyev commented on TIKA-2144: -- No idea. I was given a huge library of documents (Office and PDF) and told to implement full text search. I might or might not be able to track down the author, but that's irrelevant. If it opens in Word, it's a valid document, ergo it should be in my index. > NullPointerException on a valid Word file > - > > Key: TIKA-2144 > URL: https://issues.apache.org/jira/browse/TIKA-2144 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Proposal ID 17 Offeror ChromoLogic.docx > > > On the attached Word file, which opens fine in Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2144) NullPointerException on a valid Word file
Seva Alekseyev created TIKA-2144: Summary: NullPointerException on a valid Word file Key: TIKA-2144 URL: https://issues.apache.org/jira/browse/TIKA-2144 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached Word file, which opens fine in Word, the Tika parser throws the following error: java.lang.NullPointerException at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2145) InvalidFormatException on a valid Word file
Seva Alekseyev created TIKA-2145: Summary: InvalidFormatException on a valid Word file Key: TIKA-2145 URL: https://issues.apache.org/jira/browse/TIKA-2145 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: safety_analysis_report_FINAL2.docx On the attached Word file, which opens fine with Word, the Tika parser throws the following exception: org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: java.lang.IllegalArgumentException: Date for created could not be parsed: 2015-07-27 at org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408) at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124) at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69) ... 3 more Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', -MM-dd'T'HH:mm:ss.SS'Z' at org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615) at org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406) ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2145) InvalidFormatException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2145: - Attachment: safety_analysis_report_FINAL2.docx > InvalidFormatException on a valid Word file > --- > > Key: TIKA-2145 > URL: https://issues.apache.org/jira/browse/TIKA-2145 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: safety_analysis_report_FINAL2.docx > > > On the attached Word file, which opens fine with Word, the Tika parser throws > the following exception: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: java.lang.IllegalArgumentException: Date for created could not be > parsed: 2015-07-27 > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408) > at > org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124) > at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743) > at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69) > ... 3 more > Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date > 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, > -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', > -MM-dd'T'HH:mm:ss.SS'Z' > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615) > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406) > ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2144) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2144: - Attachment: Proposal ID 17 Offeror ChromoLogic.docx > NullPointerException on a valid Word file > - > > Key: TIKA-2144 > URL: https://issues.apache.org/jira/browse/TIKA-2144 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Proposal ID 17 Offeror ChromoLogic.docx > > > On the attached Word file, which opens fine in Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2142) ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2142: - Attachment: HPV8dHinge Confocal Results.ppt > ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2142 > URL: https://issues.apache.org/jira/browse/TIKA-2142 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: HPV8dHinge Confocal Results.ppt > > > On the attached PowerPoint presentation, which opens fine with PowerPoint, > the Tika parser throws the following error: > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772) > at > org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2141) MalformedByteSequenceException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2141: - Attachment: Freezer1.xlsx > MalformedByteSequenceException on a valid Excel file > > > Key: TIKA-2141 > URL: https://issues.apache.org/jira/browse/TIKA-2141 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Freezer1.xlsx > > > On the attached XLSX file, which opens fine in Excel, the Tika parser throws > the following error: > com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: > Invalid byte 3 of 3-byte UTF-8 sequence. > at > com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown > Source) > at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown > Source) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown > Source) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown > Source) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown > Source) > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown > Source) > at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) > at > org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137) > at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:116) > at > org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument$Factory.parse(Unknown > Source) > at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:85) > at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:96) > at > org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:111) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:114) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2141) MalformedByteSequenceException on a valid Excel file
Seva Alekseyev created TIKA-2141: Summary: MalformedByteSequenceException on a valid Excel file Key: TIKA-2141 URL: https://issues.apache.org/jira/browse/TIKA-2141 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Freezer1.xlsx On the attached XLSX file, which opens fine in Excel, the Tika parser throws the following error: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence. at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source) at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137) at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:116) at org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument$Factory.parse(Unknown Source) at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:85) at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:96) at org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:111) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:114) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2140) ClassCastException on a valid PDF
Seva Alekseyev created TIKA-2140: Summary: ClassCastException on a valid PDF Key: TIKA-2140 URL: https://issues.apache.org/jira/browse/TIKA-2140 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the following PDF file, which opens fine in Adobe Reader: https://dl.dropboxusercontent.com/u/92341073/FDA%20Submission%2096%20Vol.%20III.pdf the Tika parser throws the following error: java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSDictionary at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:144) at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:159) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:153) at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144) Before that, PDFBox throws some warnings: 21 Oct 2016 11:46:35 WARN BaseParser - Invalid dictionary, found: '?' but expected: '/' at offset 22061056 21 Oct 2016 11:46:36 WARN BaseParser - Invalid dictionary, found: '?' but expected: '/' at offset 22061056 21 Oct 2016 11:46:36 WARN COSParser - Object (3:0) at offset 22059324 does not end with 'endobj' but with '' So the file is somewhat malformed, but not to the point of unreadability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)