[GitHub] tika pull request #145: Tika5

2017-01-13 Thread ashutoshvsingh
GitHub user ashutoshvsingh opened a pull request:

https://github.com/apache/tika/pull/145

Tika5

change snapshot to 1.0.0

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lafaspot/tika tika5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/145.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #145


commit 9b2caa749b7d0b60dd9a683563337dd2c5598c84
Author: kraman14 
Date:   2017-01-13T01:20:48Z

added travis files

commit 576527ec5d934d02f249cc467eebe8f952338e8d
Author: kraman14 
Date:   2017-01-13T01:27:53Z

changed .travis.yml

commit 9bf4405c56c946b7bf0a0fe5cc196e3ad2d2985a
Author: kraman14 
Date:   2017-01-13T01:33:34Z

changed .travis.yml, deploy.sh

commit b921c71a8d8242d4aa644e85c17d0e1c3b10ad63
Author: kraman14 
Date:   2017-01-13T17:15:41Z

Fixed groupId to com.github.lafaspot.tikaNoExternal

- Removed the org.apache parent association
- Added lafaspot organization SCM and distrubution management, only to
tika-parent and tika-core.

commit 2fac44ac1dbd8a4f956ed27a1d858f94392d8df6
Author: ashutoshvsingh 
Date:   2017-01-14T00:54:03Z

testing

commit 9414a75aa6428bf21020a47cfb12f04a10306d92
Author: ashutoshvsingh 
Date:   2017-01-14T01:05:15Z

change snapshot to 1.0.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-2239) Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822344#comment-15822344
 ] 

Tim Allison commented on TIKA-2239:
---

Will take a look next week.  Thank you for opening this.

The experimental SAX/DOCX parser catches this exception when loading the 
Numbering part...leading to no numbering...er, success...of some kind.

Not sure there's a quick fix if the beans aren't working, but I'll take a look.

> Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> ---
>
> Key: TIKA-2239
> URL: https://issues.apache.org/jira/browse/TIKA-2239
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Jorge Spinsanti
> Attachments: tika2239.docx
>
>
> I got an exception to extract text from DOCX due to SAXParseException on 
> Apache POI. See stacktrace:
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1114)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1050)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:199)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.eclipse.jetty.server.Server.handle(Server.java:462)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:281)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
>   at 
> org.eclipse.jetty.io.AbstractConnection$1.run(AbstractConnection.java:505)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:118)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:87)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:204)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at 

[jira] [Resolved] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2232.
---
Resolution: Fixed

Let me know if you'd like different behavior.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822253#comment-15822253
 ] 

Tim Allison edited comment on TIKA-2232 at 1/13/17 7:51 PM:


Proposed change if jbig2 is not on the classpath:

PDFParser extractInlineImages adds:
{noformat}
X-TIKA:EXCEPTION:warn : org.apache.pdfbox.filter.MissingImageReaderException: 
Cannot read JBIG2 image: jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54)
{noformat}
to the metadata of the PDF...

ImageParser checks for JBIG2 in {{try{ Class.forName } ... }} before adding 
jbig2 to {{SUPPORTED_TYPES}}.  If jbig2 is not on the cp, then the files are 
handled by the EmptyParser, as they used to be.


was (Author: talli...@mitre.org):
Proposed change if jbig2 is not on the classpath:

PDFParser extractInlineImages adds:
{noformat}
X-TIKA:EXCEPTION:warn : org.apache.pdfbox.filter.MissingImageReaderException: 
Cannot read JBIG2 image: jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54)
{noformat}
to the metadata of the PDF...

ImageParser checks for JBIG2 in {{try{ Class.forName}} before adding jp2 to 
{{SUPPORTED_TYPES}}.  If jbig2 is not on the cp, then the files are handled by 
the EmptyParser, as they used to be.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822253#comment-15822253
 ] 

Tim Allison commented on TIKA-2232:
---

Proposed change if jbig2 is not on the classpath:

PDFParser extractInlineImages adds:
{noformat}
X-TIKA:EXCEPTION:warn : org.apache.pdfbox.filter.MissingImageReaderException: 
Cannot read JBIG2 image: jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54)
{noformat}
to the metadata of the PDF...

ImageParser checks for JBIG2 in {{try{ Class.forName}} before adding jp2 to 
{{SUPPORTED_TYPES}}.  If jbig2 is not on the cp, then the files are handled by 
the EmptyParser, as they used to be.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2240) MS Write File

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2240:
--
Attachment: 746255.doc

> MS Write File
> -
>
> Key: TIKA-2240
> URL: https://issues.apache.org/jira/browse/TIKA-2240
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: 746255.doc
>
>
> We're currently identifying MS Write Files by suffix ".wri" in one place in 
> our mime defs, but we're also using MS Write File's magic {{0x31be}} to 
> identify the file as an MSWord (doc) file in a different definition.
> In govdocs1, there are a handful of .wri files with suffix .doc.  We're 
> getting an Invalid Header exception for these files.
> I think it would be better to move their magic out of our .doc definition to 
> the .wri definition and use the EmptyParser.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2240) MS Write File

2017-01-13 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2240:
-

 Summary: MS Write File
 Key: TIKA-2240
 URL: https://issues.apache.org/jira/browse/TIKA-2240
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial


We're currently identifying MS Write Files by suffix ".wri" in one place in our 
mime defs, but we're also using MS Write File's magic {{0x31be}} to 
identify the file as an MSWord (doc) file in a different definition.

In govdocs1, there are a handful of .wri files with suffix .doc.  We're getting 
an Invalid Header exception for these files.

I think it would be better to move their magic out of our .doc definition to 
the .wri definition and use the EmptyParser.

Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-2232:
---

Reopen to handle jbig2 not on class path

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822212#comment-15822212
 ] 

Tim Allison commented on TIKA-2232:
---

We should be catching that and storing it in a metadata:warn key.  Are you 
getting that with trunk?

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Nicholas DiPiazza (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822136#comment-15822136
 ] 

Nicholas DiPiazza commented on TIKA-2232:
-

[~pascal.essiembre] totally

obviously with the GPL3 license most people cannot use this jbig2-imageio 
Library. So can we please provide a way to turn off this exception?

{code}
org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: 
jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:55) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:147)
 ~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:359) 
~[tika-parsers-1.13.jar:1.13]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:271) 
~[tika-parsers-1.13.jar:1.13]
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214) 
~[tika-parsers-1.13.jar:1.13]
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
~[pdfbox-2.0.1.jar:2.0.1]

{code}

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2232) Add JBIG2 image parsing support

2017-01-13 Thread Nicholas DiPiazza (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822136#comment-15822136
 ] 

Nicholas DiPiazza edited comment on TIKA-2232 at 1/13/17 6:39 PM:
--

[~pascal.essiembre] totally

obviously with the GPL3 license most people cannot use this jbig2-imageio 
Library. So can we please provide a way to turn off this exception?

{code}
org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: 
jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:55) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:147)
 ~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) 
~[pdfbox-2.0.1.jar:2.0.1] 
{code}


was (Author: nicholas.dipiazza):
[~pascal.essiembre] totally

obviously with the GPL3 license most people cannot use this jbig2-imageio 
Library. So can we please provide a way to turn off this exception?

{code}
org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: 
jbig2-imageio is not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:55) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:147)
 ~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:359) 
~[tika-parsers-1.13.jar:1.13]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:271) 
~[tika-parsers-1.13.jar:1.13]
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) 
~[pdfbox-2.0.1.jar:2.0.1]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214) 
~[tika-parsers-1.13.jar:1.13]
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
~[pdfbox-2.0.1.jar:2.0.1]
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
~[pdfbox-2.0.1.jar:2.0.1]

{code}

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2239) Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser

2017-01-13 Thread Jorge Spinsanti (JIRA)
Jorge Spinsanti created TIKA-2239:
-

 Summary: Illegal IOException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
 Key: TIKA-2239
 URL: https://issues.apache.org/jira/browse/TIKA-2239
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
Reporter: Jorge Spinsanti


I got an exception to extract text from DOCX due to SAXParseException on Apache 
POI. See stacktrace:

{code}
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1114)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1050)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:199)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:462)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:281)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
at 
org.eclipse.jetty.io.AbstractConnection$1.run(AbstractConnection.java:505)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 16 more
Caused by: java.io.IOException: Unable to parse xml bean
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:118)
at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
 Source)
at 
org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:87)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:204)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 22 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; The 
encoding declaration is required in the text declaration.
at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at 
org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
... 32 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: FW: tika-2.x-windows - Build # 94 - Still Failing

2017-01-13 Thread Allison, Timothy B.
Until we get 2.x-windows working (again ? did it ever work?), polling should be 
never or hourly for change in git, I guess, like the others?

The build seems to have stopped for now, which is good.

I’m not sure why it was running hourly even without a git change…

Any help you could offer getting that build working, would be great!

From: lewis john mcgibbney [mailto:lewi...@apache.org]
Sent: Friday, January 13, 2017 12:14 PM
To: Allison, Timothy B. 
Cc: dev@tika.apache.org
Subject: Re: FW: tika-2.x-windows - Build # 94 - Still Failing

Hi Tim,
What do you want to change the polling to? We can make it nightly or something.
What do you want?
Thanks

On Thu, Jan 5, 2017 at 4:35 AM, Allison, Timothy B. 
> wrote:
Lewis,
  Looks like our 2.x windows build is still failing.  The new behavior, though, 
is that Jenkins is trying every couple of hours.  Any chance you'd have time to 
look into this?  Thank you!

-Original Message-
From: Apache Jenkins Server 
[mailto:jenk...@builds.apache.org]
Sent: Wednesday, January 4, 2017 4:18 PM
To: dev@tika.apache.org
Subject: tika-2.x-windows - Build # 94 - Still Failing

The Apache Jenkins build system has built tika-2.x-windows (build #94)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/94/ to 
view the results.



--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: FW: tika-2.x-windows - Build # 94 - Still Failing

2017-01-13 Thread lewis john mcgibbney
Hi Tim,
What do you want to change the polling to? We can make it nightly or
something.
What do you want?
Thanks

On Thu, Jan 5, 2017 at 4:35 AM, Allison, Timothy B. 
wrote:

> Lewis,
>   Looks like our 2.x windows build is still failing.  The new behavior,
> though, is that Jenkins is trying every couple of hours.  Any chance you'd
> have time to look into this?  Thank you!
>
> -Original Message-
> From: Apache Jenkins Server [mailto:jenk...@builds.apache.org]
> Sent: Wednesday, January 4, 2017 4:18 PM
> To: dev@tika.apache.org
> Subject: tika-2.x-windows - Build # 94 - Still Failing
>
> The Apache Jenkins build system has built tika-2.x-windows (build #94)
>
> Status: Still Failing
>
> Check console output at https://builds.apache.org/job/tika-2.x-windows/94/
> to view the results.
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


[jira] [Resolved] (TIKA-2238) Add mime detection for embedded MSEquation files

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2238.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Add mime detection for embedded MSEquation files
> 
>
> Key: TIKA-2238
> URL: https://issues.apache.org/jira/browse/TIKA-2238
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.0, 1.15
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2238) Add mime detection for embedded MSEquation files

2017-01-13 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2238:
-

 Summary: Add mime detection for embedded MSEquation files
 Key: TIKA-2238
 URL: https://issues.apache.org/jira/browse/TIKA-2238
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2181) Upgrade to POI 3.16-beta2 when available

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821796#comment-15821796
 ] 

Tim Allison commented on TIKA-2181:
---

Remove NPE check around {{getShapes}} in XSSFExcelExtractorDecorator once we 
upgrade POI. 

> Upgrade to POI 3.16-beta2 when available
> 
>
> Key: TIKA-2181
> URL: https://issues.apache.org/jira/browse/TIKA-2181
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2181) Upgrade to POI 3.16-beta2 when available

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821796#comment-15821796
 ] 

Tim Allison edited comment on TIKA-2181 at 1/13/17 1:52 PM:


Remove NPE check around {{getShapes}} in XSSFExcelExtractorDecorator once we 
upgrade POI. TIKA-2134 


was (Author: talli...@mitre.org):
Remove NPE check around {{getShapes}} in XSSFExcelExtractorDecorator once we 
upgrade POI. 

> Upgrade to POI 3.16-beta2 when available
> 
>
> Key: TIKA-2181
> URL: https://issues.apache.org/jira/browse/TIKA-2181
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2134) Different NullPointerException on a valid Excel file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2134.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Workaround added in Tika for now.

> Different NullPointerException on a valid Excel file
> 
>
> Key: TIKA-2134
> URL: https://issues.apache.org/jira/browse/TIKA-2134
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: $R1ZQO9F.xlsx
>
>
> On the attached Excel file, which opens fine in Excel, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at org.apache.poi.xssf.usermodel.XSSFDrawing.(XSSFDrawing.java:89)
>   at org.apache.poi.xssf.usermodel.XSSFDrawing.(XSSFDrawing.java:97)
>   at 
> org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:308)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:152)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2216.
---
Resolution: Duplicate

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2205) IllegalArgumentException on a valid Excel file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2205.
---
Resolution: Duplicate

> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2205
> URL: https://issues.apache.org/jira/browse/TIKA-2205
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: SAT19-11-25-09_Selected Dates.xls
>
>
> The attached file, which opens in Excel, errors out in Tika:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:432
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
>   at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
>   at gov.nih.niaid.fscanner.Extract.ExtractContents:69



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2152) NullPointerException on a valid Word file

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821755#comment-15821755
 ] 

Tim Allison edited comment on TIKA-2152 at 1/13/17 1:16 PM:


This file is parsed without problem by the new experimental SAX docx parser.


was (Author: talli...@mitre.org):
These files are parsed without problem by the new experimental SAX docx parser.

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2152
> URL: https://issues.apache.org/jira/browse/TIKA-2152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: A5346.docx
>
>
> On the attached Word document, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2152) NullPointerException on a valid Word file

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821755#comment-15821755
 ] 

Tim Allison commented on TIKA-2152:
---

These files are parsed without problem by the new experimental SAX docx parser.

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2152
> URL: https://issues.apache.org/jira/browse/TIKA-2152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: A5346.docx
>
>
> On the attached Word document, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821756#comment-15821756
 ] 

Tim Allison commented on TIKA-2163:
---

This parsed without problem by the new experimental SAX docx parser.

> POIXMLException from ClassCastException on a valid Word template
> 
>
> Key: TIKA-2163
> URL: https://issues.apache.org/jira/browse/TIKA-2163
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: ChronologicalResume.dotx
>
>
> On the attached Word template, which opens fine with Word, the Tika parser 
> throws the following error:
> org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
>   ... 10 more
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
>   at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template

2017-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821754#comment-15821754
 ] 

Tim Allison commented on TIKA-2147:
---

These files are parsed without problem by the new experimental SAX docx parser.

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: basicresume.docx, Forefront Fax.dotx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2207) ArrayIndexOutOfBoundsException on a valid Excel file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2207.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Works now.

> ArrayIndexOutOfBoundsException on a valid Excel file
> 
>
> Key: TIKA-2207
> URL: https://issues.apache.org/jira/browse/TIKA-2207
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: Merck 9333 MPS 9-22-16.xlsx
>
>
> The attached file, which opens in Excel, errors out in Tika:
> java.lang.ArrayIndexOutOfBoundsException: 32
>   at 
> org.apache.commons.compress.compressors.lzw.LZWInputStream.initializeTables:126
>   at 
> org.apache.commons.compress.compressors.z.ZCompressorInputStream.:54
>   at 
> org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream:237
>   at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat:109
>   at org.apache.tika.parser.pkg.ZipContainerDetector.detect:95
>   at org.apache.tika.detect.CompositeDetector.detect:77
>   at org.apache.tika.parser.AutoDetectParser.parse:112
>   at org.apache.tika.parser.DelegatingParser.parse:72
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded:102
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:245
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML:105
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2166.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Works now

> TaggedIOException from a ZipException on a valid Word file
> --
>
> Key: TIKA-2166
> URL: https://issues.apache.org/jira/browse/TIKA-2166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: AMSMIC briefing doc.docx
>
>
> On the attached file, which opens with Word, Tika throws:
> org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>   at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63)
>   at gov.nih.niaid.temp.Main.main(Main.java:68)
> Caused by: org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>   ... 12 more
> Caused by: java.util.zip.ZipException: invalid block type
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2162) "Unknown compression method" on a Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2162.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

works now.

> "Unknown compression method" on a Powerpoint file
> -
>
> Key: TIKA-2162
> URL: https://issues.apache.org/jira/browse/TIKA-2162
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: DECAY.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> unknown compression method
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: unknown compression method
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2153.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Works now.

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: daids.ppt, IAVI Team meeting FINAL.ppt, 
> Jinwoo_032910.pptx, Marcia Lecture.PPT, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx
> EDIT: similar exception on daids.ppt
> EDIT: similar exception on "IAVI Team meeting FINAL.ppt", but see #2162
> EDIT: "Marcia Lecture.PPT"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2136) External file links in PPTX misparsed

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2136.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> External file links in PPTX misparsed
> -
>
> Key: TIKA-2136
> URL: https://issues.apache.org/jira/browse/TIKA-2136
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: 81809 lab presentation.pptx
>
>
> The attached document contains links to external files. Trying to parse it 
> with the Tika parser throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfEmptyURI(PackagePartName.java:204)
>   at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:174)
>   at 
> org.apache.poi.openxml4j.opc.PackagePartName.(PackagePartName.java:85)
>   at 
> org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:493)
>   at 
> org.apache.poi.openxml4j.opc.PackagePart.getRelatedPart(PackagePart.java:485)
>   at 
> org.apache.poi.xslf.usermodel.XSLFSlideShow.(XSLFSlideShow.java:86)
>   at 
> org.apache.poi.xslf.extractor.XSLFPowerPointExtractor.(XSLFPowerPointExtractor.java:62)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:244)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> The error happens in the URI validator, but not because the URI fails 
> validation; the function fails because partURI.getPath() returns a null and 
> there's no null check. The link in the file may not be valid, but it's not 
> malformed. And it definitely shouldn't prevent text extraction from the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2161) EOFException on a valid Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2161.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> EOFException on a valid Powerpoint file
> ---
>
> Key: TIKA-2161
> URL: https://issues.apache.org/jira/browse/TIKA-2161
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: Erik-LymeChipBranchSeminar.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at java.nio.file.Files.copy(Files.java:2908)
>   at java.nio.file.Files.copy(Files.java:3027)
>   at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
>   at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>   at 
> org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 22 more
> EDIT: Tika 1.14 throws EOFException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2164.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Tested with all attached.  No exceptions thrown.

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> paperfigures.ppt, Research Forum 2013.3.ppt, suba.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.
> EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"
> "suba" exhibits a similar error, "invalid distance too far back" but in a 
> different exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2215.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

resolved with TIKA-2159

> TikaException about "Invalid embedded resource" on a valid PPT file
> ---
>
> Key: TIKA-2215
> URL: https://issues.apache.org/jira/browse/TIKA-2215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: Iverson.ppt
>
>
> On the attached file, which opens with PowerPoint, the Tika parser throws the 
> following error:
> org.apache.tika.exception.TikaException: Invalid embedded resource
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 
> 16706699264 in stream of length 164352
>   at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2159.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

I added TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM property to 
store stacktraces in the parent file when there is an or other exception trying 
to read the stream of an embedded file.  

May be some areas for further work...I focused on MSOffice, PDF and RTF.

> Handle pre-parse embedded object exceptions uniformly and more robustly
> ---
>
> Key: TIKA-2159
> URL: https://issues.apache.org/jira/browse/TIKA-2159
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> When an embedded document is parsed and causes an exception, we're currently 
> catching that and swallowing it in ParsingEmbeddedDocumentExtractor (the 
> default) or reporting it in the RecursiveParserWrapper by storing the 
> stacktrace in the Metadata of the embedded document.
> However, if there's an exception during detection on the embedded stream or 
> on getting the stream _before_ the stream hits the parser, we aren't handling 
> that uniformly or robustly across parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2204.
---
Resolution: Fixed

fixed with TIKA-2159

> IndexOutOfBoundsException on a valid Powerpoint file
> 
>
> Key: TIKA-2204
> URL: https://issues.apache.org/jira/browse/TIKA-2204
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 061511.pptx
>
>
> The attached file, which opens in Powerpoint, errors in Tika:
> java.lang.IndexOutOfBoundsException: Block 733 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335
>   at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2204) IndexOutOfBoundsException on a valid Powerpoint file

2017-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2204:
--
Fix Version/s: 1.15
   2.0

> IndexOutOfBoundsException on a valid Powerpoint file
> 
>
> Key: TIKA-2204
> URL: https://issues.apache.org/jira/browse/TIKA-2204
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: 061511.pptx
>
>
> The attached file, which opens in Powerpoint, errors in Tika:
> java.lang.IndexOutOfBoundsException: Block 733 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents:449
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.:335
>   at org.apache.poi.poifs.filesystem.POIFSFileSystem.:87
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE:226
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts:197
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML:115
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:112
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)