rey bernal created PDFBOX-3079:
----------------------------------
Summary: Extracting text between bookmarks not working
Key: PDFBOX-3079
URL: https://issues.apache.org/jira/browse/PDFBOX-3079
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 2.0.0
Environment: Windows
Reporter: rey bernal
Priority: Critical
Fix For: 2.0.0
org.apache.pdfbox.text.PDFTextStripper does not really support extraction of
content between bookmarks. from looking at the code in
pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
it is clear that is using the bookmarks that the user provided to determine the
pages to extract content from.
There is a business need to extract the text that lies strictly between
bookmarks. Refer to the attached example program and sample file.
The extraction to the sections in the first page all return the entire first
page instead of the content inside each bookmark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]