PDFTextStripper doesn't process text annotations
------------------------------------------------
Key: PDFBOX-1143
URL: https://issues.apache.org/jira/browse/PDFBOX-1143
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Reporter: Michael McCandless
Priority: Minor
Users are able to add annotations (comments) to a PDF, and PDFBox
processes them correctly: you can retrieve them via
PDPage.getAnnotations.
But PDFTextStripper currently doesn't extract the text from
annotations.
I think it [optionally] should?
I think we'd add a boolean (shouldProcessAnnotations?), and if
enabled, we'd visit the annotations that have sub-type FreeText, and
extract what text we can (Subject, TitlePopup, Contents, maybe
RichContents?), associate the .getRectangle with the text to make a
TextPosition, and then somehow associate that with the right
"article" (so that annotations "over" a given article are rendered
with it).
Alternatively we just put all annotations into their own "article"?
I'm not familiar enough with PDF text positioning nor PDFTextStripper
to work out a real patch here... but I think this approach should
work?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira