[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614367#comment-15614367 ] Sharath Kumar commented on TIKA-2146: - [~talli...@mitre.org] I ran the same document that i have attached using tika 1.13 I get the below issue even in 1.13 . I have one more protected document MS Word 97( which I cant share due to the sensitive data in that, that also returns in error. Below are the error logs. I have question. Does tika support extrating the contents of a protected MS-word doument. The doument in question is not password prtotected though. Output 1: C:\Users\sk\Downloads>java -jar tika-app-1.13.jar Testbug.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.Offic at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.IllegalStateException: Told we're for characters 8236 -> 10293, but actually covers 2055 characters! at org.apache.poi.hwpf.model.TextPiece.(TextPiece.java:73) at org.apache.poi.hwpf.model.TextPieceTable.(TextPieceTable.java:112) at org.apache.poi.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:72) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:602) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more Output 2: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f27a732 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:342) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at >
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614327#comment-15614327 ] Nick Burch commented on TIKA-2146: -- As per https://poi.apache.org/encryption.html, there's no support in Apache POI for reading password protected .doc files, only .docx ones. Sadly that means, unless someone volunteers to add the support to POI, that haven't the password won't actually help... > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at >
[jira] [Commented] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document
[ https://issues.apache.org/jira/browse/TIKA-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613123#comment-15613123 ] Frank Refol commented on TIKA-2148: --- Relating TIKA-1761 because it sounds like the same issue. However, reporter states that the problem does not occur when the document is created using Office 2007. Which is not the same as my experience. > Tika app is unable to parse a password protected PowerPoint (97-2003) > document > --- > > Key: TIKA-2148 > URL: https://issues.apache.org/jira/browse/TIKA-2148 > Project: Tika > Issue Type: Bug > Components: cli >Affects Versions: 1.13 > Environment: Windows console. >Reporter: Frank Refol > Labels: Office, PowerPoint > Attachments: This is password protected (Created with MS 2003).ppt, > This is password protected (Created with MS 2007).ppt, This is password > protected (Created with MS 2010).ppt > > > Using the Tika command-line application to extract text from a PowerPoint > 97-2003 document fails. Here's the basic command that was used: > {quote} > java -jar tika-app-1.13.jar -t --password=password "This is password > protected (Created with MS 2003).ppt" > {quote} > The following exception is thrown on the console: > {noformat} > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@62204612 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) > Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: > PowerPoint file is encrypted. The correct password needs to be set via > Biff8EncryptionKey.setCurrentUserPassword() > at > org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179) > at > org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > {noformat} > Note that this happens with a PPT file that is created using Office 2010, > Office 2007, or Office 2003. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document
[ https://issues.apache.org/jira/browse/TIKA-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Refol updated TIKA-2148: -- Attachment: This is password protected (Created with MS 2010).ppt This is password protected (Created with MS 2007).ppt This is password protected (Created with MS 2003).ppt > Tika app is unable to parse a password protected PowerPoint (97-2003) > document > --- > > Key: TIKA-2148 > URL: https://issues.apache.org/jira/browse/TIKA-2148 > Project: Tika > Issue Type: Bug > Components: cli >Affects Versions: 1.13 > Environment: Windows console. >Reporter: Frank Refol > Labels: Office, PowerPoint > Attachments: This is password protected (Created with MS 2003).ppt, > This is password protected (Created with MS 2007).ppt, This is password > protected (Created with MS 2010).ppt > > > Using the Tika command-line application to extract text from a PowerPoint > 97-2003 document fails. Here's the basic command that was used: > {quote} > java -jar tika-app-1.13.jar -t --password=password "This is password > protected (Created with MS 2003).ppt" > {quote} > The following exception is thrown on the console: > {noformat} > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@62204612 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) > Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: > PowerPoint file is encrypted. The correct password needs to be set via > Biff8EncryptionKey.setCurrentUserPassword() > at > org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179) > at > org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > {noformat} > Note that this happens with a PPT file that is created using Office 2010, > Office 2007, or Office 2003. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document
Frank Refol created TIKA-2148: - Summary: Tika app is unable to parse a password protected PowerPoint (97-2003) document Key: TIKA-2148 URL: https://issues.apache.org/jira/browse/TIKA-2148 Project: Tika Issue Type: Bug Components: cli Affects Versions: 1.13 Environment: Windows console. Reporter: Frank Refol Using the Tika command-line application to extract text from a PowerPoint 97-2003 document fails. Here's the basic command that was used: {quote} java -jar tika-app-1.13.jar -t --password=password "This is password protected (Created with MS 2003).ppt" {quote} The following exception is thrown on the console: {noformat} Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@62204612 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: PowerPoint file is encrypted. The correct password needs to be set via Biff8EncryptionKey.setCurrentUserPassword() at org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179) at org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more {noformat} Note that this happens with a PPT file that is created using Office 2010, Office 2007, or Office 2003. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2147) ClassCastException on a valid Word template
Seva Alekseyev created TIKA-2147: Summary: ClassCastException on a valid Word template Key: TIKA-2147 URL: https://issues.apache.org/jira/browse/TIKA-2147 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Forefront Fax.dotx On the attached document template, which opens fine in Word, the Tika parser throws the following error: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument at org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) at org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) at org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) at org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2147: - Attachment: Forefront Fax.dotx > ClassCastException on a valid Word template > --- > > Key: TIKA-2147 > URL: https://issues.apache.org/jira/browse/TIKA-2147 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Forefront Fax.dotx > > > On the attached document template, which opens fine in Word, the Tika parser > throws the following error: > java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be > cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613023#comment-15613023 ] Frank Refol edited comment on TIKA-2146 at 10/27/16 7:58 PM: - I just ran into this issue as well. I am testing unprotecting MS-WORD docs from command-line using the Tika app 1.13. I ran into the problem trying to open a Word 97-2003 document: {quote} java -jar tika-app-1.13.jar -t --password=password "This is password protected.doc" {quote} I am attaching the sample doc that I am using for testing. The password is simply, password. BTW, there is no problem parsing a non-password protected document. Also, FYI, the test file was created using MS Office 2010 by using the Save As Word 97-2003 document option. was (Author: t3knoid): I just ran into this issue as well. I am testing unprotecting MS-WORD docs from command-line using the Tika app 1.13. I ran into the problem trying to open a Word 97-2003 document: {quote} java -jar tika-app-1.13.jar -t --password=password "This is password protected.doc" {quote} I am attaching the sample doc that I am using for testing. The password is simply, password. BTW, there is no problem parsing a non-password protected document. > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at >
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613023#comment-15613023 ] Frank Refol commented on TIKA-2146: --- I just ran into this issue as well. I am testing unprotecting MS-WORD docs from command-line using the Tika app 1.13. I ran into the problem trying to open a Word 97-2003 document: {quote} java -jar tika-app-1.13.jar -t --password=password "This is password protected.doc" {quote} I am attaching the sample doc that I am using for testing. The password is simply, password. BTW, there is no problem parsing a non-password protected document. > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at >
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Refol updated TIKA-2146: -- Attachment: This is password protected.doc > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611753#comment-15611753 ] Seva Alekseyev commented on TIKA-2144: -- No idea. I was given a huge library of documents (Office and PDF) and told to implement full text search. I might or might not be able to track down the author, but that's irrelevant. If it opens in Word, it's a valid document, ergo it should be in my index. > NullPointerException on a valid Word file > - > > Key: TIKA-2144 > URL: https://issues.apache.org/jira/browse/TIKA-2144 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Proposal ID 17 Offeror ChromoLogic.docx > > > On the attached Word file, which opens fine in Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611671#comment-15611671 ] Sharath Kumar edited comment on TIKA-2146 at 10/27/16 12:36 PM: Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs(protected word docs) java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) was (Author: mnsk07): Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Attachment: Test bug.doc > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611671#comment-15611671 ] Sharath Kumar commented on TIKA-2146: - Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at >
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611639#comment-15611639 ] Tim Allison commented on TIKA-2146: --- Are you able to share the document? Do you have the password for the document? > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Component/s: parser > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
Sharath Kumar created TIKA-2146: --- Summary: Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException Key: TIKA-2146 URL: https://issues.apache.org/jira/browse/TIKA-2146 Project: Tika Issue Type: Bug Components: core Affects Versions: 1.11 Environment: Windows 7 Reporter: Sharath Kumar When I try to parse a MS word document which is protected, I am unable to extract the content rather, i get the below exception org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:537) at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) at java.security.AccessController.doPrivileged(Native Method) at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) at org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) at org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610867#comment-15610867 ] Nick Burch commented on TIKA-2144: -- Do you know how the file in question was generated? It seems to have paragraphs with stylings applied, but not style definitions at all, which seems a bit odd... > NullPointerException on a valid Word file > - > > Key: TIKA-2144 > URL: https://issues.apache.org/jira/browse/TIKA-2144 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Proposal ID 17 Offeror ChromoLogic.docx > > > On the attached Word file, which opens fine in Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)