Alexander Kazakov created TIKA-2098:
---------------------------------------
Summary: Tika.parseToString() with maxLength doesn't work
correctly for PDF files
Key: TIKA-2098
URL: https://issues.apache.org/jira/browse/TIKA-2098
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.13
Reporter: Alexander Kazakov
When parsing PDF file with Tika.parseToString(InputStream stream, Metadata
metadata, int maxLength) and maxLength < content size it throws Exception.
{code:java}
org.apache.tika.exception.TikaException: Unable to extract all PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:568)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a
string: Tika - Content Analysis Toolkit
at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
at
org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
at
org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
at
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
... 35 more
Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more
than 100 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is
however available).
org.apache.tika.sax.TaggedSAXException: Your document contained more than 100
characters, and so your requested limit has been reached. To receive the full
text of the document, increase your limit. (Text up to the limit is however
available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 100 characters, and so your requested limit has
been reached. To receive the full text of the document, increase your limit.
(Text up to the limit is however available).
at
org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
... 43 more
Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more
than 100 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is
however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 100 characters, and so your requested limit has
been reached. To receive the full text of the document, increase your limit.
(Text up to the limit is however available).
at
org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
... 51 more
Caused by:
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 100 characters, and so your requested limit has
been reached. To receive the full text of the document, increase your limit.
(Text up to the limit is however available).
at
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
... 52 more
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)