[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614367#comment-15614367
 ] 

Sharath Kumar commented on TIKA-2146:
-

[~talli...@mitre.org]

I ran the same document that i have attached using tika 1.13 I get the below 
issue even in 1.13 . I have one more protected document MS Word 97( which I 
cant share due to the sensitive data in that, that also returns in error. Below 
are the error logs. I have question. Does tika support extrating the contents 
of a protected MS-word doument. The doument in question is not password 
prtotected though.

Output 1:
C:\Users\sk\Downloads>java -jar tika-app-1.13.jar Testbug.doc
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.Offic
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: java.lang.IllegalStateException: Told we're for characters 8236 -> 
10293, but actually covers 2055 characters!
at org.apache.poi.hwpf.model.TextPiece.(TextPiece.java:73)
at 
org.apache.poi.hwpf.model.TextPieceTable.(TextPieceTable.java:112)
at 
org.apache.poi.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:72)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:602)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 5 more


Output 2:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f27a732
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:342)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 5 more



> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> 

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614327#comment-15614327
 ] 

Nick Burch commented on TIKA-2146:
--

As per https://poi.apache.org/encryption.html, there's no support in Apache POI 
for reading password protected .doc files, only .docx ones. Sadly that means, 
unless someone volunteers to add the support to POI, that haven't the password 
won't actually help...

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> 

[jira] [Commented] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document

2016-10-27 Thread Frank Refol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613123#comment-15613123
 ] 

Frank Refol commented on TIKA-2148:
---

Relating TIKA-1761 because it sounds like the same issue. However, reporter 
states that the problem does not occur when the document is created using 
Office 2007. Which is not the same as my experience.

> Tika app is unable to parse a password protected PowerPoint (97-2003) 
> document 
> ---
>
> Key: TIKA-2148
> URL: https://issues.apache.org/jira/browse/TIKA-2148
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.13
> Environment: Windows console.
>Reporter: Frank Refol
>  Labels: Office, PowerPoint
> Attachments: This is password protected (Created with MS 2003).ppt, 
> This is password protected (Created with MS 2007).ppt, This is password 
> protected (Created with MS 2010).ppt
>
>
> Using the Tika command-line application to extract text from a PowerPoint 
> 97-2003 document fails. Here's the basic command that was used:
> {quote}
> java -jar tika-app-1.13.jar -t --password=password "This is password 
> protected (Created with MS 2003).ppt"
> {quote}
> The following exception is thrown on the console:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@62204612
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: 
> PowerPoint file is encrypted. The correct password needs to be set via 
> Biff8EncryptionKey.setCurrentUserPassword()
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 5 more
> {noformat}
> Note that this happens with a PPT file that is created using Office 2010, 
> Office 2007, or Office 2003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document

2016-10-27 Thread Frank Refol (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Refol updated TIKA-2148:
--
Attachment: This is password protected (Created with MS 2010).ppt
This is password protected (Created with MS 2007).ppt
This is password protected (Created with MS 2003).ppt

> Tika app is unable to parse a password protected PowerPoint (97-2003) 
> document 
> ---
>
> Key: TIKA-2148
> URL: https://issues.apache.org/jira/browse/TIKA-2148
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.13
> Environment: Windows console.
>Reporter: Frank Refol
>  Labels: Office, PowerPoint
> Attachments: This is password protected (Created with MS 2003).ppt, 
> This is password protected (Created with MS 2007).ppt, This is password 
> protected (Created with MS 2010).ppt
>
>
> Using the Tika command-line application to extract text from a PowerPoint 
> 97-2003 document fails. Here's the basic command that was used:
> {quote}
> java -jar tika-app-1.13.jar -t --password=password "This is password 
> protected (Created with MS 2003).ppt"
> {quote}
> The following exception is thrown on the console:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@62204612
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: 
> PowerPoint file is encrypted. The correct password needs to be set via 
> Biff8EncryptionKey.setCurrentUserPassword()
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 5 more
> {noformat}
> Note that this happens with a PPT file that is created using Office 2010, 
> Office 2007, or Office 2003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2148) Tika app is unable to parse a password protected PowerPoint (97-2003) document

2016-10-27 Thread Frank Refol (JIRA)
Frank Refol created TIKA-2148:
-

 Summary: Tika app is unable to parse a password protected 
PowerPoint (97-2003) document 
 Key: TIKA-2148
 URL: https://issues.apache.org/jira/browse/TIKA-2148
 Project: Tika
  Issue Type: Bug
  Components: cli
Affects Versions: 1.13
 Environment: Windows console.
Reporter: Frank Refol


Using the Tika command-line application to extract text from a PowerPoint 
97-2003 document fails. Here's the basic command that was used:
{quote}
java -jar tika-app-1.13.jar -t --password=password "This is password protected 
(Created with MS 2003).ppt"
{quote}

The following exception is thrown on the console:
{noformat}
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@62204612
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: 
PowerPoint file is encrypted. The correct password needs to be set via 
Biff8EncryptionKey.setCurrentUserPassword()
at 
org.apache.poi.hslf.usermodel.HSLFSlideShowEncrypted.(HSLFSlideShowEncrypted.java:106)
at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:284)
at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275)
at 
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:179)
at 
org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:182)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 5 more
{noformat}

Note that this happens with a PPT file that is created using Office 2010, 
Office 2007, or Office 2003.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2147) ClassCastException on a valid Word template

2016-10-27 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2147:


 Summary: ClassCastException on a valid Word template
 Key: TIKA-2147
 URL: https://issues.apache.org/jira/browse/TIKA-2147
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Forefront Fax.dotx

On the attached document template, which opens fine in Word, the Tika parser 
throws the following error:

java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast 
to org.apache.poi.xwpf.usermodel.XWPFDocument
at 
org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
at 
org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
at 
org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
at 
org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template

2016-10-27 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2147:
-
Attachment: Forefront Fax.dotx

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Frank Refol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613023#comment-15613023
 ] 

Frank Refol edited comment on TIKA-2146 at 10/27/16 7:58 PM:
-

I just ran into this issue as well. I am testing unprotecting MS-WORD docs from 
command-line using the Tika app 1.13. I ran into the problem trying to open a 
Word 97-2003 document:

{quote}
java -jar tika-app-1.13.jar -t --password=password "This is password 
protected.doc"
{quote}

I am attaching the sample doc that I am using for testing. The password is 
simply, password.

BTW, there is no problem parsing a non-password protected document. Also, FYI, 
the test file was created using MS Office 2010 by using the Save As Word 
97-2003 document option.


was (Author: t3knoid):
I just ran into this issue as well. I am testing unprotecting MS-WORD docs from 
command-line using the Tika app 1.13. I ran into the problem trying to open a 
Word 97-2003 document:

{quote}
java -jar tika-app-1.13.jar -t --password=password "This is password 
protected.doc"
{quote}

I am attaching the sample doc that I am using for testing. The password is 
simply, password.

BTW, there is no problem parsing a non-password protected document.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> 

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Frank Refol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613023#comment-15613023
 ] 

Frank Refol commented on TIKA-2146:
---

I just ran into this issue as well. I am testing unprotecting MS-WORD docs from 
command-line using the Tika app 1.13. I ran into the problem trying to open a 
Word 97-2003 document:

{quote}
java -jar tika-app-1.13.jar -t --password=password "This is password 
protected.doc"
{quote}

I am attaching the sample doc that I am using for testing. The password is 
simply, password.

BTW, there is no problem parsing a non-password protected document.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> 

[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Frank Refol (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Refol updated TIKA-2146:
--
Attachment: This is password protected.doc

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-27 Thread Seva Alekseyev (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611753#comment-15611753
 ] 

Seva Alekseyev commented on TIKA-2144:
--

No idea. I was given a huge library of documents (Office and PDF) and told to 
implement full text search. I might or might not be able to track down the 
author, but that's irrelevant. If it opens in Word, it's a valid document, ergo 
it should be in my index.

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611671#comment-15611671
 ] 

Sharath Kumar edited comment on TIKA-2146 at 10/27/16 12:36 PM:


Sure. I have uploaded the doc. The file is not password protected. 
I also see errors like the below for these type of docs(protected word docs)

java.security.PrivilegedActionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@29402a40
at java.security.AccessController.doPrivileged(Native Method)


was (Author: mnsk07):
Sure. I have uploaded the doc. The file is not password protected. 
I also see errors like the below for these type of docs

java.security.PrivilegedActionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@29402a40
at java.security.AccessController.doPrivileged(Native Method)

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharath Kumar updated TIKA-2146:

Attachment: Test bug.doc

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611671#comment-15611671
 ] 

Sharath Kumar commented on TIKA-2146:
-

Sure. I have uploaded the doc. The file is not password protected. 
I also see errors like the below for these type of docs

java.security.PrivilegedActionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@29402a40
at java.security.AccessController.doPrivileged(Native Method)

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> 

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611639#comment-15611639
 ] 

Tim Allison commented on TIKA-2146:
---

Are you able to share the document?

Do you have the password for the document?

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharath Kumar updated TIKA-2146:

Component/s: parser

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Sharath Kumar (JIRA)
Sharath Kumar created TIKA-2146:
---

 Summary: Unable to extract contents from protected MS 
word-doc-java.lang.ArrayIndexOutOfBoundsException
 Key: TIKA-2146
 URL: https://issues.apache.org/jira/browse/TIKA-2146
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.11
 Environment: Windows 7
Reporter: Sharath Kumar


When I try to parse a MS word document which is protected, I am unable to 
extract the content rather, i get the below exception

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@29402a40
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:537)
at 
org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
at 
org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
at 
org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
at 
org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
at 
org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
at 
org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
at 
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
at 
org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
at 
org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
at 
org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
at 
org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
at 
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
at 
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at 
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at 
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at 
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at 
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at 
org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at 
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610867#comment-15610867
 ] 

Nick Burch commented on TIKA-2144:
--

Do you know how the file in question was generated? It seems to have paragraphs 
with stylings applied, but not style definitions at all, which seems a bit 
odd...

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)