[
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2098.
-------------------------------
Resolution: Fixed
Assignee: Tim Allison
Fix Version/s: 1.14
2.0
Good catch. Thank you!
> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> ------------------------------------------------------------------------
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Alexander Kazakov
> Assignee: Tim Allison
> Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a
> string: Tika - Content Analysis Toolkit
> at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
> at
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
> ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained
> more than 100 characters, and so your requested limit has been reached. To
> receive the full text of the document, increase your limit. (Text up to the
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100
> characters, and so your requested limit has been reached. To receive the full
> text of the document, increase your limit. (Text up to the limit is however
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
> document contained more than 100 characters, and so your requested limit has
> been reached. To receive the full text of the document, increase your limit.
> (Text up to the limit is however available).
> at
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
> at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
> at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
> ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained
> more than 100 characters, and so your requested limit has been reached. To
> receive the full text of the document, increase your limit. (Text up to the
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
> document contained more than 100 characters, and so your requested limit has
> been reached. To receive the full text of the document, increase your limit.
> (Text up to the limit is however available).
> at
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> ... 51 more
> Caused by:
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
> document contained more than 100 characters, and so your requested limit has
> been reached. To receive the full text of the document, increase your limit.
> (Text up to the limit is however available).
> at
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> ... 52 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)