[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2217:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.RuntimeException for : "Couldn't instantiate the 
class for type with id 1000 on class class org.apache.poi.hslf.record.Document 
: java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1"
java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at 

[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2217:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor47.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at 

[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130

  was:
java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130


> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Description: 
java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130

  was:
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130


> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2217:


 Summary: RuntimeException on a PPT with a movie
 Key: TIKA-2217
 URL: https://issues.apache.org/jira/browse/TIKA-2217
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


java.lang.RuntimeException for 
63933/<\\ai-storm\FScan\Scan_2016-12-16_01-06-55\Folders\75457622\lecture WH 
2002.ppt>: "Couldn't instantiate the class for type with id 1000 on class class 
org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1"
java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at 

[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Attachment: TB Coord RFCb.doc

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2216:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2216
 URL: https://issues.apache.org/jira/browse/TIKA-2216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2215:


 Summary: TikaException about "Invalid embedded resource" on a 
valid PPT file
 Key: TIKA-2215
 URL: https://issues.apache.org/jira/browse/TIKA-2215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Iverson.ppt

On the attached file, which opens with PowerPoint, the Tika parser throws the 
following error:

org.apache.tika.exception.TikaException: Invalid embedded resource
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 
16706699264 in stream of length 164352
at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2215:
-
Attachment: Iverson.ppt

> TikaException about "Invalid embedded resource" on a valid PPT file
> ---
>
> Key: TIKA-2215
> URL: https://issues.apache.org/jira/browse/TIKA-2215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Iverson.ppt
>
>
> On the attached file, which opens with PowerPoint, the Tika parser throws the 
> following error:
> org.apache.tika.exception.TikaException: Invalid embedded resource
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 
> 16706699264 in stream of length 164352
>   at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2214:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2214
 URL: https://issues.apache.org/jira/browse/TIKA-2214
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: NONCONT.DOC

On the attached file, which opens with Word, the Tika parser throws the 
following error:

java.lang.ArrayIndexOutOfBoundsException: 
at java.lang.System.arraycopy:-2
at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171
at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101
at org.apache.poi.hwpf.model.OldPAPBinTable.:49
at org.apache.poi.hwpf.HWPFOldDocument.:105
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2214:
-
Attachment: NONCONT.DOC

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2214
> URL: https://issues.apache.org/jira/browse/TIKA-2214
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: NONCONT.DOC
>
>
> On the attached file, which opens with Word, the Tika parser throws the 
> following error:
> java.lang.ArrayIndexOutOfBoundsException: 
>   at java.lang.System.arraycopy:-2
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101
>   at org.apache.poi.hwpf.model.OldPAPBinTable.:49
>   at org.apache.poi.hwpf.HWPFOldDocument.:105
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2213:
-
Attachment: biennial - 96.doc

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2213
> URL: https://issues.apache.org/jira/browse/TIKA-2213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: biennial - 96.doc
>
>
> On the attached file, which opens in Word, Tika parser throws the following 
> error:
> java.lang.ArrayIndexOutOfBoundsException: 
>   at java.lang.System.arraycopy:-2
>   at org.apache.poi.hwpf.model.TextPieceTable.:109
>   at org.apache.poi.hwpf.model.ComplexFileTable.:70
>   at org.apache.poi.hwpf.HWPFOldDocument.:68
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2213:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2213
 URL: https://issues.apache.org/jira/browse/TIKA-2213
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens in Word, Tika parser throws the following 
error:

java.lang.ArrayIndexOutOfBoundsException: 
at java.lang.System.arraycopy:-2
at org.apache.poi.hwpf.model.TextPieceTable.:109
at org.apache.poi.hwpf.model.ComplexFileTable.:70
at org.apache.poi.hwpf.HWPFOldDocument.:68
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2207) ArrayIndexOutOfBoundsException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2207:


 Summary: ArrayIndexOutOfBoundsException on a valid Excel file
 Key: TIKA-2207
 URL: https://issues.apache.org/jira/browse/TIKA-2207
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Merck 9333 MPS 9-22-16.xlsx

The attached file, which opens in Excel, errors out in Tika:

java.lang.ArrayIndexOutOfBoundsException: 32
at 
org.apache.commons.compress.compressors.lzw.LZWInputStream.initializeTables:126
at 
org.apache.commons.compress.compressors.z.ZCompressorInputStream.:54
at 
org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream:237
at 
org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat:109
at org.apache.tika.parser.pkg.ZipContainerDetector.detect:95
at org.apache.tika.detect.CompositeDetector.detect:77
at org.apache.tika.parser.AutoDetectParser.parse:112
at org.apache.tika.parser.DelegatingParser.parse:72
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded:102
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:245
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2207) ArrayIndexOutOfBoundsException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2207:
-
Attachment: Merck 9333 MPS 9-22-16.xlsx

> ArrayIndexOutOfBoundsException on a valid Excel file
> 
>
> Key: TIKA-2207
> URL: https://issues.apache.org/jira/browse/TIKA-2207
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Merck 9333 MPS 9-22-16.xlsx
>
>
> The attached file, which opens in Excel, errors out in Tika:
> java.lang.ArrayIndexOutOfBoundsException: 32
>   at 
> org.apache.commons.compress.compressors.lzw.LZWInputStream.initializeTables:126
>   at 
> org.apache.commons.compress.compressors.z.ZCompressorInputStream.:54
>   at 
> org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream:237
>   at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat:109
>   at org.apache.tika.parser.pkg.ZipContainerDetector.detect:95
>   at org.apache.tika.detect.CompositeDetector.detect:77
>   at org.apache.tika.parser.AutoDetectParser.parse:112
>   at org.apache.tika.parser.DelegatingParser.parse:72
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded:102
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:245
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2206) RecordFormatException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2206:
-
Attachment: Budget_storyboard_V2_06282013.xls

> RecordFormatException on a valid Excel file
> ---
>
> Key: TIKA-2206
> URL: https://issues.apache.org/jira/browse/TIKA-2206
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Budget_storyboard_V2_06282013.xls
>
>
> The attached file, which opens fine in Excel, errors out in Tika:
> org.apache.poi.hssf.record.RecordFormatException for 
> 63773/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\70664525\Budget_storyboard_V2_06282013.xls>:
>  "Leftover 3 bytes in subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 
> 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 
> 00, 00, 00, 00, 00, 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, 
> EE, 1F, 12, 00, 0B, 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 
> 00, 00, 00, 03, 00, 00, 00, 18, 00, 00, 00, 00, 01, 00]"
> org.apache.poi.hssf.record.RecordFormatException: Leftover 3 bytes in 
> subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 00, 00, 00, 00, 00, 
> 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 00, 00, 00, 00, 
> 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 12, 00, 0B, 
> 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 03, 00, 
> 00, 00, 18, 00, 00, 00, 00, 01, 00]
>   at org.apache.poi.hssf.record.ObjRecord.:108
>   at sun.reflect.GeneratedConstructorAccessor14.newInstance:-1
>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
>   at java.lang.reflect.Constructor.newInstance:-1
>   at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84
>   at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
>   at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2206) RecordFormatException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2206:


 Summary: RecordFormatException on a valid Excel file
 Key: TIKA-2206
 URL: https://issues.apache.org/jira/browse/TIKA-2206
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


The attached file, which opens fine in Excel, errors out in Tika:

org.apache.poi.hssf.record.RecordFormatException for 
63773/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\70664525\Budget_storyboard_V2_06282013.xls>:
 "Leftover 3 bytes in subrecord data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 
00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 
00, 00, 00, 00, 00, 00, 01, 00, 04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 
12, 00, 0B, 00, 00, 00, 00, 00, 3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 
03, 00, 00, 00, 18, 00, 00, 00, 00, 01, 00]"
org.apache.poi.hssf.record.RecordFormatException: Leftover 3 bytes in subrecord 
data [15, 00, 12, 00, 12, 00, 3E, 00, 11, 20, 00, 00, 00, 00, 00, 00, 00, 00, 
00, 00, 00, 00, 0C, 00, 14, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 00, 01, 00, 
04, 00, 00, 00, 10, 00, 01, 00, 13, 00, EE, 1F, 12, 00, 0B, 00, 00, 00, 00, 00, 
3B, 00, 00, 00, 00, 02, 00, 00, 00, 00, 00, 00, 03, 00, 00, 00, 18, 00, 00, 00, 
00, 01, 00]
at org.apache.poi.hssf.record.ObjRecord.:108
at sun.reflect.GeneratedConstructorAccessor14.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84
at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2205) IllegalArgumentException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2205:
-
Attachment: SAT19-11-25-09_Selected Dates.xls

> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2205
> URL: https://issues.apache.org/jira/browse/TIKA-2205
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: SAT19-11-25-09_Selected Dates.xls
>
>
> The attached file, which opens in Excel, errors out in Tika:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
>   at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2205) IllegalArgumentException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2205:


 Summary: IllegalArgumentException on a valid Excel file
 Key: TIKA-2205
 URL: https://issues.apache.org/jira/browse/TIKA-2205
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


The attached file, which opens in Excel, errors out in Tika:

java.lang.IllegalArgumentException: Cannot format given Object as a Number
at java.text.DecimalFormat.format:-1
at java.text.Format.format:-1
at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
at gov.nih.niaid.fscanner.Extract.ExtractContents:69
org.apache.tika.exception.TikaException for 
63269/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\51541330\engelAPBD 
copy.pptx>: "Error creating OOXML extractor"
org.apache.tika.exception.TikaException: Error creating OOXML extractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2205) IllegalArgumentException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2205:
-
Description: 
The attached file, which opens in Excel, errors out in Tika:

java.lang.IllegalArgumentException: Cannot format given Object as a Number
at java.text.DecimalFormat.format:-1
at java.text.Format.format:-1
at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
at gov.nih.niaid.fscanner.Extract.ExtractContents:69


  was:
The attached file, which opens in Excel, errors out in Tika:

java.lang.IllegalArgumentException: Cannot format given Object as a Number
at java.text.DecimalFormat.format:-1
at java.text.Format.format:-1
at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
at gov.nih.niaid.fscanner.Extract.ExtractContents:69
org.apache.tika.exception.TikaException for 
63269/<\\ai-storm\FScan\Scan_2016-12-11_11-14-13\Folders\51541330\engelAPBD 
copy.pptx>: "Error creating OOXML extractor"
org.apache.tika.exception.TikaException: Error creating OOXML extractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87


> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2205
> URL: https://issues.apache.org/jira/browse/TIKA-2205
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: SAT19-11-25-09_Selected Dates.xls
>
>
> The attached file, which opens in Excel, errors out in Tika:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> 

[jira] [Created] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2204:


 Summary: IndexOutOfBoundsException on a valid Powerpoint file
 Key: TIKA-2204
 URL: https://issues.apache.org/jira/browse/TIKA-2204
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: 061511.pptx

The attached file, which opens in Powerpoint, errors in Tika:

java.lang.IndexOutOfBoundsException: Block 733 not found
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335
at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2204:
-
Attachment: 061511.pptx

> IndexOutOfBoundsException on a valid Powerpoint file
> 
>
> Key: TIKA-2204
> URL: https://issues.apache.org/jira/browse/TIKA-2204
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 061511.pptx
>
>
> The attached file, which opens in Powerpoint, errors in Tika:
> java.lang.IndexOutOfBoundsException: Block 733 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335
>   at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2203) InvalidOperationException on a valid Word file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2203:


 Summary: InvalidOperationException on a valid Word file
 Key: TIKA-2203
 URL: https://issues.apache.org/jira/browse/TIKA-2203
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: OPCCompliance_DerivedPartNameFAIL.docx

The attached Word file, which opens in Word, errors out in Tika:

org.apache.tika.exception.TikaException: Error creating OOXML extractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:123
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
at gov.nih.niaid.fscanner.Extract.ExtractContents:69
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: You 
can't add a part with a part name derived from another part ! [M1.11]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:338
at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774
at org.apache.poi.openxml4j.opc.OPCPackage.open:268
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: You 
can't add a part with a part name derived from another part ! [M1.11]
at org.apache.poi.openxml4j.opc.PackagePartCollection.put:66
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:336
at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774
at org.apache.poi.openxml4j.opc.OPCPackage.open:268
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2203) InvalidOperationException on a valid Word file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2203:
-
Attachment: OPCCompliance_DerivedPartNameFAIL.docx

> InvalidOperationException on a valid Word file
> --
>
> Key: TIKA-2203
> URL: https://issues.apache.org/jira/browse/TIKA-2203
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: OPCCompliance_DerivedPartNameFAIL.docx
>
>
> The attached Word file, which opens in Word, errors out in Tika:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:123
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: You 
> can't add a part with a part name derived from another part ! [M1.11]
>   at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:338
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774
>   at org.apache.poi.openxml4j.opc.OPCPackage.open:268
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: You 
> can't add a part with a part name derived from another part ! [M1.11]
>   at org.apache.poi.openxml4j.opc.PackagePartCollection.put:66
>   at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl:336
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts:774
>   at org.apache.poi.openxml4j.opc.OPCPackage.open:268
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:69
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2202) StringIndexOutOfBoundsException on a valid Word document

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2202:
-
Attachment: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc

> StringIndexOutOfBoundsException on a valid Word document
> 
>
> Key: TIKA-2202
> URL: https://issues.apache.org/jira/browse/TIKA-2202
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc
>
>
> The attachged document, which opens in Word, errors out in Tika:
> java.lang.StringIndexOutOfBoundsException: String index out of range: 0
>   at java.lang.String.charAt:-1
>   at 
> org.apache.tika.parser.microsoft.ListManager.convertToNewNumberText:152
>   at org.apache.tika.parser.microsoft.ListManager.buildTuple:111
>   at org.apache.tika.parser.microsoft.ListManager.getFormattedNumber:86
>   at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph:298
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:179
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2202) StringIndexOutOfBoundsException on a valid Word document

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2202:


 Summary: StringIndexOutOfBoundsException on a valid Word document
 Key: TIKA-2202
 URL: https://issues.apache.org/jira/browse/TIKA-2202
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: 052306.ITN032AD_Lack_Protocolv0.45_22May06.doc

The attachged document, which opens in Word, errors out in Tika:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt:-1
at 
org.apache.tika.parser.microsoft.ListManager.convertToNewNumberText:152
at org.apache.tika.parser.microsoft.ListManager.buildTuple:111
at org.apache.tika.parser.microsoft.ListManager.getFormattedNumber:86
at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph:298
at org.apache.tika.parser.microsoft.WordExtractor.parse:179
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2201) OutOfMemoryError on a reasonably sized document

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2201:
-
Description: 
The following document, which is not particularly big, causes an OOM in Tika 
parser:

https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx

Java memory limit is 4GB.

  was:The attached document, which is not particularly big, causes an OOM in 
Tika parser.


> OutOfMemoryError on a reasonably sized document
> ---
>
> Key: TIKA-2201
> URL: https://issues.apache.org/jira/browse/TIKA-2201
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> The following document, which is not particularly big, causes an OOM in Tika 
> parser:
> https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx
> Java memory limit is 4GB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2201) OutOfMemoryError on a reasonably sized document

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2201:


 Summary: OutOfMemoryError on a reasonably sized document
 Key: TIKA-2201
 URL: https://issues.apache.org/jira/browse/TIKA-2201
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


The attached document, which is not particularly big, causes an OOM in Tika 
parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2197) TikaException from invalid URL in an Excel document

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2197:
-
Attachment: NCI WH presentation JAC 3-23-15_234pm.pptx

> TikaException from invalid URL in an Excel document
> ---
>
> Key: TIKA-2197
> URL: https://issues.apache.org/jira/browse/TIKA-2197
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: NCI WH presentation JAC 3-23-15_234pm.pptx, 
> Neut_paratope_updated_0813_naming_formattable.xlsx
>
>
> The attached document, which open fine in Excel (if slowly), causes the 
> following error in the Tika parser:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
> Caused by: java.lang.IllegalArgumentException: targetUri invalid - 
> http://invalid.uri
>   at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427
>   at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2200) XML schema mismatch error on a valid Word document

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2200:


 Summary: XML schema mismatch error on a valid Word document
 Key: TIKA-2200
 URL: https://issues.apache.org/jira/browse/TIKA-2200
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: MK2048_FROM_ISENTRIS.docx

The attached document, which opens in Word, errors out in Tika:

org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: error: The 
document is not a 
document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document 
element local name mismatch expected document got wordDocument
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:241
at org.apache.poi.POIXMLDocument.load:190
at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
Caused by: org.apache.xmlbeans.XmlException: error: The document is not a 
document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document 
element local name mismatch expected document got wordDocument
at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType:459
at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument:364
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1391
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1370
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse:370
at org.apache.poi.POIXMLTypeLoader.parse:116
at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse:-1
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:164
at org.apache.poi.POIXMLDocument.load:190
at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2200) XML schema mismatch error on a valid Word document

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2200:
-
Attachment: MK2048_FROM_ISENTRIS.docx

> XML schema mismatch error on a valid Word document
> --
>
> Key: TIKA-2200
> URL: https://issues.apache.org/jira/browse/TIKA-2200
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: MK2048_FROM_ISENTRIS.docx
>
>
> The attached document, which opens in Word, errors out in Tika:
> org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: error: The 
> document is not a 
> document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: 
> document element local name mismatch expected document got wordDocument
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:241
>   at org.apache.poi.POIXMLDocument.load:190
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
>   at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
>   at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
> Caused by: org.apache.xmlbeans.XmlException: error: The document is not a 
> document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: 
> document element local name mismatch expected document got wordDocument
>   at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType:459
>   at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument:364
>   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1391
>   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1370
>   at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse:370
>   at org.apache.poi.POIXMLTypeLoader.parse:116
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse:-1
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:164
>   at org.apache.poi.POIXMLDocument.load:190
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
>   at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
>   at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2199) RecordFormatException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2199:


 Summary: RecordFormatException on a valid Excel file
 Key: TIKA-2199
 URL: https://issues.apache.org/jira/browse/TIKA-2199
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: CDC survcost.xls

The attached file, which opens in Excel, causes an error in Tika parser:

org.apache.poi.util.RecordFormatException: Unable to construct record instance
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:98
at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.IllegalArgumentException: Start index must be less than 
end index.
at org.apache.poi.hssf.usermodel.HSSFRichTextString.applyFont:136
at org.apache.poi.hssf.record.TextObjectRecord.processFontRuns:155
at org.apache.poi.hssf.record.TextObjectRecord.:131
at sun.reflect.GeneratedConstructorAccessor19.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84
at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2199) RecordFormatException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2199:
-
Attachment: CDC survcost.xls

> RecordFormatException on a valid Excel file
> ---
>
> Key: TIKA-2199
> URL: https://issues.apache.org/jira/browse/TIKA-2199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CDC survcost.xls
>
>
> The attached file, which opens in Excel, causes an error in Tika parser:
> org.apache.poi.util.RecordFormatException: Unable to construct record instance
>   at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:98
>   at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
>   at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IllegalArgumentException: Start index must be less than 
> end index.
>   at org.apache.poi.hssf.usermodel.HSSFRichTextString.applyFont:136
>   at org.apache.poi.hssf.record.TextObjectRecord.processFontRuns:155
>   at org.apache.poi.hssf.record.TextObjectRecord.:131
>   at sun.reflect.GeneratedConstructorAccessor19.newInstance:-1
>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
>   at java.lang.reflect.Constructor.newInstance:-1
>   at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84
>   at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307
>   at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2198) NullPointerException on a valid Word file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2198:
-
Attachment: CIPRA SA concept project 2 rev JM.doc

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2198
> URL: https://issues.apache.org/jira/browse/TIKA-2198
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CIPRA SA concept project 2 rev JM.doc
>
>
> On the attached file, which opens fine in Word, the Tika parser throws the 
> following error:
> java.lang.NullPointerException: 
>   at org.apache.poi.hwpf.model.ListTables.getLevel:141
>   at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph:125
>   at org.apache.poi.hwpf.usermodel.Range.getParagraph:766
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:178
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2198) NullPointerException on a valid Word file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2198:


 Summary: NullPointerException on a valid Word file
 Key: TIKA-2198
 URL: https://issues.apache.org/jira/browse/TIKA-2198
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens fine in Word, the Tika parser throws the 
following error:

java.lang.NullPointerException: 
at org.apache.poi.hwpf.model.ListTables.getLevel:141
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph:125
at org.apache.poi.hwpf.usermodel.Range.getParagraph:766
at org.apache.tika.parser.microsoft.WordExtractor.parse:178
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2197) TikaException from invalid URL in an Excel document

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2197:


 Summary: TikaException from invalid URL in an Excel document
 Key: TIKA-2197
 URL: https://issues.apache.org/jira/browse/TIKA-2197
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Neut_paratope_updated_0813_naming_formattable.xlsx

The attached document, which open fine in Excel (if slowly), causes the 
following error in the Tika parser:

org.apache.tika.exception.TikaException: Error creating OOXML extractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
Caused by: java.lang.IllegalArgumentException: targetUri invalid - 
http://invalid.uri
at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427
at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2197) TikaException from invalid URL in an Excel document

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2197:
-
Attachment: Neut_paratope_updated_0813_naming_formattable.xlsx

> TikaException from invalid URL in an Excel document
> ---
>
> Key: TIKA-2197
> URL: https://issues.apache.org/jira/browse/TIKA-2197
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Neut_paratope_updated_0813_naming_formattable.xlsx
>
>
> The attached document, which open fine in Excel (if slowly), causes the 
> following error in the Tika parser:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:120
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
> Caused by: java.lang.IllegalArgumentException: targetUri invalid - 
> http://invalid.uri
>   at org.apache.poi.openxml4j.opc.PackagingURIHelper.resolvePartUri:427
>   at org.apache.poi.openxml4j.opc.PackageRelationship.getTargetURI:206
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.addDrawingHyperLinks:182
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML:134
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:112
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2196) IllegalArgumentException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2196:
-
Attachment: 2007 Experiment watch.xls

> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2196
> URL: https://issues.apache.org/jira/browse/TIKA-2196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 2007 Experiment watch.xls
>
>
> On the attached Excel file, which opens fine in Excel, Tika throws the 
> following error:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
>   at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2196) IllegalArgumentException on a valid Excel file

2016-12-13 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2196:


 Summary: IllegalArgumentException on a valid Excel file
 Key: TIKA-2196
 URL: https://issues.apache.org/jira/browse/TIKA-2196
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: 2007 Experiment watch.xls

On the attached Excel file, which opens fine in Excel, Tika throws the 
following error:

java.lang.IllegalArgumentException: Cannot format given Object as a Number
at java.text.DecimalFormat.format:-1
at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67
at java.text.Format.format:-1
at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2185) NegativeArraySizeException on a valid Word file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2185:
-
Attachment: PatentW final.doc

> NegativeArraySizeException on a valid Word file
> ---
>
> Key: TIKA-2185
> URL: https://issues.apache.org/jira/browse/TIKA-2185
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: PatentW final.doc
>
>
> On the attached document, which opens fine with Word, the Tika parser throws 
> the following:
> java.lang.NegativeArraySizeException: 
>   at org.apache.poi.hwpf.model.StyleDescription.:122
>   at org.apache.poi.hwpf.model.StyleSheet.:107
>   at org.apache.poi.hwpf.HWPFDocument.:289
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:151
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2185) NegativeArraySizeException on a valid Word file

2016-11-23 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2185:


 Summary: NegativeArraySizeException on a valid Word file
 Key: TIKA-2185
 URL: https://issues.apache.org/jira/browse/TIKA-2185
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: PatentW final.doc

On the attached document, which opens fine with Word, the Tika parser throws 
the following:

java.lang.NegativeArraySizeException: 
at org.apache.poi.hwpf.model.StyleDescription.:122
at org.apache.poi.hwpf.model.StyleSheet.:107
at org.apache.poi.hwpf.HWPFDocument.:289
at org.apache.tika.parser.microsoft.WordExtractor.parse:151
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2184) RecordFormatException on a valid Excel file

2016-11-23 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2184:


 Summary: RecordFormatException on a valid Excel file
 Key: TIKA-2184
 URL: https://issues.apache.org/jira/browse/TIKA-2184
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: HIVT Discrepancy Report- 3-29-04UCSF.xls

On the attached file, which opens fine with Excel, the Tika parser throws the 
following:

org.apache.poi.hssf.record.RecordFormatException: Unhandled Continue Record 
followining class org.apache.poi.hssf.record.TabIdRecord
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:379
at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:177
at org.apache.tika.parser.microsoft.OfficeParser.parse:130




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2184) RecordFormatException on a valid Excel file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2184:
-
Attachment: HIVT Discrepancy Report- 3-29-04UCSF.xls

> RecordFormatException on a valid Excel file
> ---
>
> Key: TIKA-2184
> URL: https://issues.apache.org/jira/browse/TIKA-2184
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: HIVT Discrepancy Report- 3-29-04UCSF.xls
>
>
> On the attached file, which opens fine with Excel, the Tika parser throws the 
> following:
> org.apache.poi.hssf.record.RecordFormatException: Unhandled Continue Record 
> followining class org.apache.poi.hssf.record.TabIdRecord
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:379
>   at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx
EDIT: similar exception on daids.ppt
EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162
EDIT: "Marcia Lecture.PPT"

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
   

[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Attachment: Marcia Lecture.PPT

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: IAVI Team meeting FINAL.ppt, Jinwoo_032910.pptx, Marcia 
> Lecture.PPT, daids.ppt, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx
> EDIT: similar exception on daids.ppt
> EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx
EDIT: similar exception on daids.ppt
EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at 

[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Attachment: IAVI Team meeting FINAL.ppt

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: IAVI Team meeting FINAL.ppt, Jinwoo_032910.pptx, 
> daids.ppt, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx
> EDIT: similar exception on daids.ppt
> EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx
EDIT: similar exception on daids.pptx

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at 

[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Attachment: daids.ppt

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jinwoo_032910.pptx, daids.ppt, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx
> EDIT: similar exception on daids.pptx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-23 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx
EDIT: similar exception on daids.ppt

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at 

[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Attachment: Jinwoo_032910.pptx

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jinwoo_032910.pptx, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at 

[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Description: 
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at 
org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 22 more

EDIT: Tika 1.14 throws EOFException

  was:
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 

[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Summary: EOFException on a valid Powerpoint file  (was: TaggedIOException 
from EOFException on a valid Powerpoint file)

> EOFException on a valid Powerpoint file
> ---
>
> Key: TIKA-2161
> URL: https://issues.apache.org/jira/browse/TIKA-2161
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Erik-LymeChipBranchSeminar.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at java.nio.file.Files.copy(Files.java:2908)
>   at java.nio.file.Files.copy(Files.java:3027)
>   at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
>   at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>   at 
> org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2166:
-
Attachment: AMSMIC briefing doc.docx

> TaggedIOException from a ZipException on a valid Word file
> --
>
> Key: TIKA-2166
> URL: https://issues.apache.org/jira/browse/TIKA-2166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: AMSMIC briefing doc.docx
>
>
> On the attached file, which opens with Word, Tika throws:
> org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>   at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63)
>   at gov.nih.niaid.temp.Main.main(Main.java:68)
> Caused by: org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>   ... 12 more
> Caused by: java.util.zip.ZipException: invalid block type
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2166:


 Summary: TaggedIOException from a ZipException on a valid Word file
 Key: TIKA-2166
 URL: https://issues.apache.org/jira/browse/TIKA-2166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens with Word, Tika throws:

org.apache.tika.io.TaggedIOException: invalid block type
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63)
at gov.nih.niaid.temp.Main.main(Main.java:68)
Caused by: org.apache.tika.io.TaggedIOException: invalid block type
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
... 12 more
Caused by: java.util.zip.ZipException: invalid block type
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2165) NegativeArraySizeException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2165:


 Summary: NegativeArraySizeException on a valid Word file
 Key: TIKA-2165
 URL: https://issues.apache.org/jira/browse/TIKA-2165
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens with Word, the Tika parser throws an error:

java.lang.NegativeArraySizeException
at org.apache.poi.hwpf.model.Ffn.(Ffn.java:79)
at org.apache.poi.hwpf.model.FontTable.(FontTable.java:66)
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:344)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"
"suba" exhibits a similar error, "invalid distance too far back" but in a 
different exception.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> 

[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: suba.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.
> EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Lab Meeting.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.
> EDIT4: in "Lab meeting", it's "Unexpected end 

[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: paperfigures.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached file emits a similar error "invalid block type".


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Jankovic final Retreat 2002.PPT

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached file emits a similar error "invalid block type".

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the other file emits a similar error "invalid block type".


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the other file emits a similar error "invalid block type".

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the other file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Research Forum 2013.3.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

  was:
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2164:


 Summary: HSLFException from ZipException "invalid stored block 
lengths" on a valid Powerpoint file
 Key: TIKA-2164
 URL: https://issues.apache.org/jira/browse/TIKA-2164
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2163:
-
Attachment: ChronologicalResume.dotx

> POIXMLException from ClassCastException on a valid Word template
> 
>
> Key: TIKA-2163
> URL: https://issues.apache.org/jira/browse/TIKA-2163
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: ChronologicalResume.dotx
>
>
> On the attached Word template, which opens fine with Word, the Tika parser 
> throws the following error:
> org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
>   ... 10 more
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
>   at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2163:


 Summary: POIXMLException from ClassCastException on a valid Word 
template
 Key: TIKA-2163
 URL: https://issues.apache.org/jira/browse/TIKA-2163
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: ChronologicalResume.dotx

On the attached Word template, which opens fine with Word, the Tika parser 
throws the following error:

org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
at 
org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
at 
org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
... 10 more
Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
at 
org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2162) "Unknown compression method" on a Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2162:


 Summary: "Unknown compression method" on a Powerpoint file
 Key: TIKA-2162
 URL: https://issues.apache.org/jira/browse/TIKA-2162
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: DECAY.ppt

On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
unknown compression method
at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: unknown compression method
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85)
... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2162) "Unknown compression method" on a Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2162:
-
Attachment: DECAY.ppt

> "Unknown compression method" on a Powerpoint file
> -
>
> Key: TIKA-2162
> URL: https://issues.apache.org/jira/browse/TIKA-2162
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: DECAY.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> unknown compression method
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: unknown compression method
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Attachment: Erik-LymeChipBranchSeminar.ppt

> TaggedIOException from EOFException on a valid Powerpoint file
> --
>
> Key: TIKA-2161
> URL: https://issues.apache.org/jira/browse/TIKA-2161
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Erik-LymeChipBranchSeminar.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at java.nio.file.Files.copy(Files.java:2908)
>   at java.nio.file.Files.copy(Files.java:3027)
>   at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
>   at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>   at 
> org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2161:


 Summary: TaggedIOException from EOFException on a valid Powerpoint 
file
 Key: TIKA-2161
 URL: https://issues.apache.org/jira/browse/TIKA-2161
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at 
org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2160:
-
Attachment: test_16022016081053.docx

> POIXMLException from NullPointerException on a valid Word file
> --
>
> Key: TIKA-2160
> URL: https://issues.apache.org/jira/browse/TIKA-2160
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: test_16022016081053.docx
>
>
> On the attached word file, which opens fine with Word (albeit with no text), 
> the Tika parser throws the following error:
> org.apache.poi.POIXMLException: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
>   at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
>   ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2160:


 Summary: POIXMLException from NullPointerException on a valid Word 
file
 Key: TIKA-2160
 URL: https://issues.apache.org/jira/browse/TIKA-2160
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached word file, which opens fine with Word (albeit with no text), 
the Tika parser throws the following error:

org.apache.poi.POIXMLException: java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
at 
org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
... 9 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2158) NullPointerException on a valid Word file

2016-11-03 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2158:
-
Attachment: RTOP_Template01112015063856.docx

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2158
> URL: https://issues.apache.org/jira/browse/TIKA-2158
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: RTOP_Template01112015063856.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49)
>   at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2158) NullPointerException on a valid Word file

2016-11-03 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2158:


 Summary: NullPointerException on a valid Word file
 Key: TIKA-2158
 URL: https://issues.apache.org/jira/browse/TIKA-2158
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: RTOP_Template01112015063856.docx

On the attached Word file, which opens fine with Word, the Tika parser throws 
the following error:

java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49)
at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35)
at 
org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2157) HSLFException on a valid Powerpoint file

2016-11-03 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2157:
-
Attachment: CRADA 2-09 K Subbarao.ppt

> HSLFException on a valid Powerpoint file
> 
>
> Key: TIKA-2157
> URL: https://issues.apache.org/jira/browse/TIKA-2157
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CRADA 2-09 K Subbarao.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> incorrect data check
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:120)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: incorrect data check
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.PICT.read(PICT.java:133)
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:116)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2155) IndexOutOfBoundsException on a valid Excel file

2016-11-03 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2155:
-
Attachment: Copy of [corrupted Unicode text].xlsx

> IndexOutOfBoundsException on a valid Excel file
> ---
>
> Key: TIKA-2155
> URL: https://issues.apache.org/jira/browse/TIKA-2155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Copy of [corrupted Unicode text].xlsx
>
>
> On the attached Excel file, which opens fine with Excel, the Tika parser 
> throws the following error:
> java.lang.IndexOutOfBoundsException: Index: 65535, Size: 251
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> org.apache.poi.xssf.model.StylesTable.getStyleAt(StylesTable.java:421)
>   at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.startElement(XSSFSheetXMLHandler.java:281)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.startElement(XSSFExcelExtractorDecorator.java:345)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:182)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:356)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2786)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:195)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2155) IndexOutOfBoundsException on a valid Excel file

2016-11-03 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2155:


 Summary: IndexOutOfBoundsException on a valid Excel file
 Key: TIKA-2155
 URL: https://issues.apache.org/jira/browse/TIKA-2155
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Copy of [corrupted Unicode text].xlsx

On the attached Excel file, which opens fine with Excel, the Tika parser throws 
the following error:

java.lang.IndexOutOfBoundsException: Index: 65535, Size: 251
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
org.apache.poi.xssf.model.StylesTable.getStyleAt(StylesTable.java:421)
at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.startElement(XSSFSheetXMLHandler.java:281)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.startElement(XSSFExcelExtractorDecorator.java:345)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:182)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:356)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2786)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:195)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2154) RecordFormatException on a valid Excel file

2016-11-03 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2154:
-
Attachment: Interface_Availability.xls

> RecordFormatException on a valid Excel file
> ---
>
> Key: TIKA-2154
> URL: https://issues.apache.org/jira/browse/TIKA-2154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Interface_Availability.xls
>
>
> On the attached XLS file, which opens fine with Excel, the Tika parser throws 
> the following error:
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
> instance
>   at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:98)
>   at 
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:334)
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:308)
>   at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:274)
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:155)
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:118)
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:309)
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:154)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.lang.IllegalStateException: Should never be called before end 
> of current record
>   at 
> org.apache.poi.hssf.record.RecordInputStream.isContinueNext(RecordInputStream.java:455)
>   at 
> org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:386)
>   at 
> org.apache.poi.hssf.record.RecordInputStream.readUnicodeLEString(RecordInputStream.java:342)
>   at org.apache.poi.hssf.record.FormatRecord.(FormatRecord.java:57)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84)
>   ... 11 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2154) RecordFormatException on a valid Excel file

2016-11-03 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2154:


 Summary: RecordFormatException on a valid Excel file
 Key: TIKA-2154
 URL: https://issues.apache.org/jira/browse/TIKA-2154
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached XLS file, which opens fine with Excel, the Tika parser throws 
the following error:

org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
instance
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:98)
at 
org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:334)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:308)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:274)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:155)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:118)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:309)
at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:154)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.lang.IllegalStateException: Should never be called before end 
of current record
at 
org.apache.poi.hssf.record.RecordInputStream.isContinueNext(RecordInputStream.java:455)
at 
org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:386)
at 
org.apache.poi.hssf.record.RecordInputStream.readUnicodeLEString(RecordInputStream.java:342)
at org.apache.poi.hssf.record.FormatRecord.(FormatRecord.java:57)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84)
... 11 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-01 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2153:


 Summary: TaggedIOException on a valid Powerpoint file
 Key: TIKA-2153
 URL: https://issues.apache.org/jira/browse/TIKA-2153
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2152) NullPointerException on a valid Word file

2016-11-01 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2152:


 Summary: NullPointerException on a valid Word file
 Key: TIKA-2152
 URL: https://issues.apache.org/jira/browse/TIKA-2152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: A5346.docx

On the attached Word document, which opens fine in Word, the Tika parser throws 
the following error:

java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2152) NullPointerException on a valid Word file

2016-11-01 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2152:
-
Attachment: A5346.docx

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2152
> URL: https://issues.apache.org/jira/browse/TIKA-2152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: A5346.docx
>
>
> On the attached Word document, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2144:
-
Attachment: (was: Proposal ID 17 Offeror ChromoLogic.docx)

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2147) ClassCastException on a valid Word template

2016-10-27 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2147:


 Summary: ClassCastException on a valid Word template
 Key: TIKA-2147
 URL: https://issues.apache.org/jira/browse/TIKA-2147
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Forefront Fax.dotx

On the attached document template, which opens fine in Word, the Tika parser 
throws the following error:

java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast 
to org.apache.poi.xwpf.usermodel.XWPFDocument
at 
org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
at 
org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
at 
org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
at 
org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template

2016-10-27 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2147:
-
Attachment: Forefront Fax.dotx

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-27 Thread Seva Alekseyev (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611753#comment-15611753
 ] 

Seva Alekseyev commented on TIKA-2144:
--

No idea. I was given a huge library of documents (Office and PDF) and told to 
implement full text search. I might or might not be able to track down the 
author, but that's irrelevant. If it opens in Word, it's a valid document, ergo 
it should be in my index.

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2144) NullPointerException on a valid Word file

2016-10-26 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2144:


 Summary: NullPointerException on a valid Word file
 Key: TIKA-2144
 URL: https://issues.apache.org/jira/browse/TIKA-2144
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached Word file, which opens fine in Word, the Tika parser throws the 
following error:

java.lang.NullPointerException
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2145) InvalidFormatException on a valid Word file

2016-10-26 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2145:


 Summary: InvalidFormatException on a valid Word file
 Key: TIKA-2145
 URL: https://issues.apache.org/jira/browse/TIKA-2145
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: safety_analysis_report_FINAL2.docx

On the attached Word file, which opens fine with Word, the Tika parser throws 
the following exception:

org.apache.tika.exception.TikaException: Error creating OOXML extractor
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: java.lang.IllegalArgumentException: Date for created could not be 
parsed: 2015-07-27
at 
org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408)
at 
org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69)
... 3 more
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date 
2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, 
-MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', 
-MM-dd'T'HH:mm:ss.SS'Z'
at 
org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615)
at 
org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406)
... 7 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2145) InvalidFormatException on a valid Word file

2016-10-26 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2145:
-
Attachment: safety_analysis_report_FINAL2.docx

> InvalidFormatException on a valid Word file
> ---
>
> Key: TIKA-2145
> URL: https://issues.apache.org/jira/browse/TIKA-2145
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: safety_analysis_report_FINAL2.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following exception:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.IllegalArgumentException: Date for created could not be 
> parsed: 2015-07-27
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408)
>   at 
> org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124)
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743)
>   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69)
>   ... 3 more
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date 
> 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, 
> -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', 
> -MM-dd'T'HH:mm:ss.SS'Z'
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615)
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406)
>   ... 7 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2144) NullPointerException on a valid Word file

2016-10-26 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2144:
-
Attachment: Proposal ID 17 Offeror ChromoLogic.docx

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2142) ArrayIndexOutOfBoundsException

2016-10-24 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2142:
-
Attachment: HPV8dHinge Confocal Results.ppt

> ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2142
> URL: https://issues.apache.org/jira/browse/TIKA-2142
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: HPV8dHinge Confocal Results.ppt
>
>
> On the attached PowerPoint presentation, which opens fine with PowerPoint, 
> the Tika parser throws the following error:
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2141) MalformedByteSequenceException on a valid Excel file

2016-10-21 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2141:
-
Attachment: Freezer1.xlsx

> MalformedByteSequenceException on a valid Excel file
> 
>
> Key: TIKA-2141
> URL: https://issues.apache.org/jira/browse/TIKA-2141
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Freezer1.xlsx
>
>
> On the attached XLSX file, which opens fine in Excel, the Tika parser throws 
> the following error:
> com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
> Invalid byte 3 of 3-byte UTF-8 sequence.
>   at 
> com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown 
> Source)
>   at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
>  Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
>  Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
>  Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>  Source)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
> Source)
>   at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown 
> Source)
>   at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
> Source)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown 
> Source)
>   at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:116)
>   at 
> org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument$Factory.parse(Unknown
>  Source)
>   at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:85)
>   at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:96)
>   at 
> org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:111)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:114)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2141) MalformedByteSequenceException on a valid Excel file

2016-10-21 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2141:


 Summary: MalformedByteSequenceException on a valid Excel file
 Key: TIKA-2141
 URL: https://issues.apache.org/jira/browse/TIKA-2141
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Freezer1.xlsx

On the attached XLSX file, which opens fine in Excel, the Tika parser throws 
the following error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
Invalid byte 3 of 3-byte UTF-8 sequence.
at 
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown 
Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
 Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
 Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
 Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
 Source)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown 
Source)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
Source)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown 
Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at 
org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:116)
at 
org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument$Factory.parse(Unknown
 Source)
at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:85)
at org.apache.poi.xssf.model.ThemesTable.(ThemesTable.java:96)
at 
org.apache.poi.xssf.eventusermodel.XSSFReader.getStylesTable(XSSFReader.java:111)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:114)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2140) ClassCastException on a valid PDF

2016-10-21 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2140:


 Summary: ClassCastException on a valid PDF
 Key: TIKA-2140
 URL: https://issues.apache.org/jira/browse/TIKA-2140
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the following PDF file, which opens fine in Adobe Reader:

https://dl.dropboxusercontent.com/u/92341073/FDA%20Submission%2096%20Vol.%20III.pdf

the Tika parser throws the following error:

java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
to org.apache.pdfbox.cos.COSDictionary
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:144)
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:159)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:153)
at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)

Before that, PDFBox throws some warnings:

21 Oct 2016 11:46:35  WARN BaseParser - Invalid dictionary, found: '?' but 
expected: '/' at offset 22061056
21 Oct 2016 11:46:36  WARN BaseParser - Invalid dictionary, found: '?' but 
expected: '/' at offset 22061056
21 Oct 2016 11:46:36  WARN COSParser - Object (3:0) at offset 22059324 does not 
end with 'endobj' but with ''

So the file is somewhat malformed, but not to the point of unreadability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >