Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-11-01 Thread Ken Krugler
[Resending - has anyone else run into this same issue, when building from the 
1.14-rc1 tag?]

Just for grins, I pulled from git and checked out the the 1.14-rc1 tag, then 
ran “mvn clean package”.

For me it fails with:

Running org.apache.tika.parser.strings.StringsParserTest
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.685 sec <<< 
FAILURE! - in org.apache.tika.parser.strings.StringsParserTest
testParse(org.apache.tika.parser.strings.StringsParserTest)  Time elapsed: 
1.685 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.tika.parser.strings.StringsParserTest.testParse(StringsParserTest.java:68)

…

Results :

Failed tests: 
 StringsParserTest.testParse:68 null

Tests run: 755, Failures: 1, Errors: 0, Skipped: 18

— Ken

> On Oct 19, 2016, at 11:48am, Chris Mattmann  wrote:
> 
> Hi Folks,
> 
> A first candidate for the Tika 1.14 release is available at:
> 
> https://dist.apache.org/repos/dist/dev/tika/
> 
> The release candidate is a zip archive of the sources in:
> 
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tree;hb=687d7706c9778e4f49f2834a07e5a9d99b23042b
>  
> 
> The SHA1 checksum of the archive is:
> ad9152392ffe6b620c8102ab538df0579b36c520
> 
> In addition, a staged maven repository is available here:
> 
> https://repository.apache.org/content/repositories/orgapachetika-1020/
> 
> Please vote on releasing this package as Apache Tika 1.14.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 1.14
> [ ] -1 Do not release this package because..
> 
> Cheers,
> Chris
> 
> P.S. Of course here is my +1.
> 
> 
> 
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626453#comment-15626453
 ] 

Hudson commented on TIKA-2098:
--

FAILURE: Integrated in Jenkins build tika-2.x #169 (See 
[https://builds.apache.org/job/tika-2.x/169/])
improve unit test for TIKA-2098 (tallison: rev 
6ca74bec6a1d448bbe3340d51dc84ca8ca58507a)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the li

tika-2.x - Build # 169 - Still Failing

2016-11-01 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x (build #169)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x/169/ to view the 
results.

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626388#comment-15626388
 ] 

Hudson commented on TIKA-2098:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1131 (See 
[https://builds.apache.org/job/Tika-trunk/1131/])
improve test for TIKA-2098 (tallison: rev 
2df68c84b043f3158c0bdfa63d1a0c8d44d7e18a)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   a

[jira] [Commented] (TIKA-2152) NullPointerException on a valid Word file

2016-11-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626330#comment-15626330
 ] 

Tim Allison commented on TIKA-2152:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=60329

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2152
> URL: https://issues.apache.org/jira/browse/TIKA-2152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: A5346.docx
>
>
> On the attached Word document, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-01 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2153:


 Summary: TaggedIOException on a valid Powerpoint file
 Key: TIKA-2153
 URL: https://issues.apache.org/jira/browse/TIKA-2153
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2152) NullPointerException on a valid Word file

2016-11-01 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2152:


 Summary: NullPointerException on a valid Word file
 Key: TIKA-2152
 URL: https://issues.apache.org/jira/browse/TIKA-2152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: A5346.docx

On the attached Word document, which opens fine in Word, the Tika parser throws 
the following error:

java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2152) NullPointerException on a valid Word file

2016-11-01 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2152:
-
Attachment: A5346.docx

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2152
> URL: https://issues.apache.org/jira/browse/TIKA-2152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: A5346.docx
>
>
> On the attached Word document, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:362)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:414)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:404)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x - Build # 168 - Failure

2016-11-01 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x (build #168)

Status: Failure

Check console output at https://builds.apache.org/job/tika-2.x/168/ to view the 
results.

[jira] [Commented] (TIKA-2151) Imposed Write Limit Causes Lost Data With Pdfs

2016-11-01 Thread Josh Cummings (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626214#comment-15626214
 ] 

Josh Cummings commented on TIKA-2151:
-

Agreed. When I did my search, I just searched for unresolved issues. Should 
have checked Resolved, too. Thanks!

> Imposed Write Limit Causes Lost Data With Pdfs
> --
>
> Key: TIKA-2151
> URL: https://issues.apache.org/jira/browse/TIKA-2151
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
>Reporter: Josh Cummings
>Priority: Critical
>
> When we upgraded to 1.13, we noticed a new exception in our logs:
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:527)
>   at org.apache.tika.Tika.parseToString(Tika.java:602)
>   at 
> com.attask.tika.WriteLimitAllCatchTikaTest.testStillNeedOverride(WriteLimitAllCatchTikaTest.java:31)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string:   One will of mine to make thy large will more. 
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:500)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
>   ... 33 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 10 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than

[jira] [Commented] (TIKA-2151) Imposed Write Limit Causes Lost Data With Pdfs

2016-11-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626194#comment-15626194
 ] 

Tim Allison commented on TIKA-2151:
---

I think this may be a duplicate of TIKA-2098.  The fix will be in Tika 1.14, 
which should be out towards the end of the week.

I just improved the unit test for TIKA-2098 to be:

{noformat}
@Test
public void testMaxLength() throws Exception {
InputStream is = getResourceAsStream("/test-documents/testPDF.pdf");
String content = new Tika().parseToString(is, new Metadata(), 100);
assertTrue(content.length() == 100);
assertContains("Tika - Content", content);
}
{noformat}

> Imposed Write Limit Causes Lost Data With Pdfs
> --
>
> Key: TIKA-2151
> URL: https://issues.apache.org/jira/browse/TIKA-2151
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
>Reporter: Josh Cummings
>Priority: Critical
>
> When we upgraded to 1.13, we noticed a new exception in our logs:
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:527)
>   at org.apache.tika.Tika.parseToString(Tika.java:602)
>   at 
> com.attask.tika.WriteLimitAllCatchTikaTest.testStillNeedOverride(WriteLimitAllCatchTikaTest.java:31)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string:   One will of mine to make thy large will more. 
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:500)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.P

[jira] [Created] (TIKA-2151) Imposed Write Limit Causes Lost Data With Pdfs

2016-11-01 Thread Josh Cummings (JIRA)
Josh Cummings created TIKA-2151:
---

 Summary: Imposed Write Limit Causes Lost Data With Pdfs
 Key: TIKA-2151
 URL: https://issues.apache.org/jira/browse/TIKA-2151
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.13
Reporter: Josh Cummings
Priority: Critical


When we upgraded to 1.13, we noticed a new exception in our logs:

org.apache.tika.exception.TikaException: Unable to extract all PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:527)
at org.apache.tika.Tika.parseToString(Tika.java:602)
at 
com.attask.tika.WriteLimitAllCatchTikaTest.testStillNeedOverride(WriteLimitAllCatchTikaTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
string:   One will of mine to make thy large will more. 
at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:500)
at 
org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
at 
org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
... 33 more
Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more 
than 10 characters, and so your requested limit has been reached. To 
receive the full text of the document, increase your limit. (Text up to the 
limit is however available).
org.apache.tika.sax.TaggedSAXException: Your document contained more than 
10 characters, and so your requested limit has been reached. To receive the 
full text of the document, increase your limit. (Text up to the limit is 
however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
document contained more than 10 characters, and so your requested limit has 
been reached. To receive the full text of the document, increa

[jira] [Commented] (TIKA-2111) Executable Parser adds Content-Type instead of setting

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625783#comment-15625783
 ] 

Hudson commented on TIKA-2111:
--

SUCCESS: Integrated in Jenkins build tika-2.x #167 (See 
[https://builds.apache.org/job/tika-2.x/167/])
TIKA-2111 - ExecutableParser should set rather than add a Content-Type 
(tallison: rev a6978521fb4c75195180d33734ceb23de8b6bd43)
* (edit) 
tika-parser-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java
* (edit) 
tika-parser-modules/tika-parser-code-module/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java


> Executable Parser adds Content-Type instead of setting
> --
>
> Key: TIKA-2111
> URL: https://issues.apache.org/jira/browse/TIKA-2111
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.0, 1.15
>
>
> The ExecutableParser {{add}} s {{Content-Type}} instead of setting it.  This 
> can lead to multiple or duplicate {{Content-Type}} s.
> Should probably have asked on the user-list first...Is this the desired 
> behavior?  If not, let's convert {{add()}} to {{set()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2111) Executable Parser adds Content-Type instead of setting

2016-11-01 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2111.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Executable Parser adds Content-Type instead of setting
> --
>
> Key: TIKA-2111
> URL: https://issues.apache.org/jira/browse/TIKA-2111
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.0, 1.15
>
>
> The ExecutableParser {{add}} s {{Content-Type}} instead of setting it.  This 
> can lead to multiple or duplicate {{Content-Type}} s.
> Should probably have asked on the user-list first...Is this the desired 
> behavior?  If not, let's convert {{add()}} to {{set()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2111) Executable Parser adds Content-Type instead of setting

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625640#comment-15625640
 ] 

Hudson commented on TIKA-2111:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1130 (See 
[https://builds.apache.org/job/Tika-trunk/1130/])
TIKA-2111 - set instead of add "Content-Type" in the ExecutableParser 
(tallison: rev 15a92302501d5ee6a319442c8109eafe37ec4595)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java


> Executable Parser adds Content-Type instead of setting
> --
>
> Key: TIKA-2111
> URL: https://issues.apache.org/jira/browse/TIKA-2111
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> The ExecutableParser {{add}} s {{Content-Type}} instead of setting it.  This 
> can lead to multiple or duplicate {{Content-Type}} s.
> Should probably have asked on the user-list first...Is this the desired 
> behavior?  If not, let's convert {{add()}} to {{set()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2143) POI deprecated method used in TIKA 1.13

2016-11-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625484#comment-15625484
 ] 

Tim Allison commented on TIKA-2143:
---

Hi [~sbathrutheen], any luck finding an older version of POI on your classpath?

> POI deprecated method used in TIKA 1.13 
> 
>
> Key: TIKA-2143
> URL: https://issues.apache.org/jira/browse/TIKA-2143
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9, 1.13
> Environment: Windows java application
>Reporter: sbathrutheen
>Priority: Trivial
> Fix For: 1.13
>
>
> We see that TIKA throws a long list of errors when extraction ppt files. We  
> tested with standalone tike application (1.13) we cannot reproduce the issue.
> We took a look at POI source code and abserved the class "HSLFSlideShow" we 
> could see the below deprecated method defined 
> *
> /**
> -  * Get the lookup from slide numbers to their offsets inside
> -  *  _ptrData, used when adding or moving slides.
> -  * 
> -  * @deprecated since POI 3.11, not supported anymore
> -  */
> - @Deprecated
> - public Hashtable getSlideOffsetDataLocationsLookup() {
> - throw new 
> UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup()
>  is not supported since 3.12-Beta1");
> - }
> *
> we may think Tika library still calling this deprecated method causing this 
> run time Exception
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140)
> ... 14 more
> Caused by: java.lang.UnsupportedOperationException
> at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source)
> at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293)
> at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273)
> at org.apache.poi.hslf.HSLFSlideShow.(HSLFSlideShow.java:188)
> at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> ... 17 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-11-01 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625287#comment-15625287
 ] 

Konstantin Gribov commented on TIKA-2056:
-

[~chrismattmann], I set "fix versions" to 1.15 just in case you wouldn't roll 
new RC. If you would, I'll update it.

> Installing exiftool causes ForkParserIntegration test errors
> 
>
> Key: TIKA-2056
> URL: https://issues.apache.org/jira/browse/TIKA-2056
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Chris A. Mattmann
>Assignee: Konstantin Gribov
> Fix For: 1.15
>
>
> [~rgauss] maybe you can help me with this. For some reason when I was trying 
> your PR, I got all sorts of weird errors that I thought had to do with your 
> PR, but in fact, had to do with Fork Parser Integration test. [~kkrugler] 
> I've seen you've contributed to the Fork parser tests so tagging you on this 
> too. Any reason you guys can think of that exiftool causes the Fork parser 
> integration tests to fail?
> Here's the log msg (that I thought was due to the Sentiment parser, but is in 
> fact not!):
> {noformat}
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 124 source files to 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/test-classes
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Some input files use or override a deprecated API.
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ tika-parsers ---
> [INFO] Surefire report directory: 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/surefire-reports
> ---
>  T E S T S
> ---
> Running org.apache.tika.parser.fork.ForkParserIntegrationTest
> Tests run: 5, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 2.46 sec <<< 
> FAILURE! - in org.apache.tika.parser.fork.ForkParserIntegrationTest
> testForkedTextParsing(org.apache.tika.parser.fork.ForkParserIntegrationTest)  
> Time elapsed: 0.185 sec  <<< ERROR!
> org.apache.tika.exception.TikaException: Unable to serialize AutoDetectParser 
> to pass to the Forked Parser
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOut

[jira] [Resolved] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-11-01 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2056.
-
   Resolution: Fixed
Fix Version/s: 1.15

> Installing exiftool causes ForkParserIntegration test errors
> 
>
> Key: TIKA-2056
> URL: https://issues.apache.org/jira/browse/TIKA-2056
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Chris A. Mattmann
>Assignee: Konstantin Gribov
> Fix For: 1.15
>
>
> [~rgauss] maybe you can help me with this. For some reason when I was trying 
> your PR, I got all sorts of weird errors that I thought had to do with your 
> PR, but in fact, had to do with Fork Parser Integration test. [~kkrugler] 
> I've seen you've contributed to the Fork parser tests so tagging you on this 
> too. Any reason you guys can think of that exiftool causes the Fork parser 
> integration tests to fail?
> Here's the log msg (that I thought was due to the Sentiment parser, but is in 
> fact not!):
> {noformat}
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 124 source files to 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/test-classes
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Some input files use or override a deprecated API.
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ tika-parsers ---
> [INFO] Surefire report directory: 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/surefire-reports
> ---
>  T E S T S
> ---
> Running org.apache.tika.parser.fork.ForkParserIntegrationTest
> Tests run: 5, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 2.46 sec <<< 
> FAILURE! - in org.apache.tika.parser.fork.ForkParserIntegrationTest
> testForkedTextParsing(org.apache.tika.parser.fork.ForkParserIntegrationTest)  
> Time elapsed: 0.185 sec  <<< ERROR!
> org.apache.tika.exception.TikaException: Unable to serialize AutoDetectParser 
> to pass to the Forked Parser
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutp