[
https://issues.apache.org/jira/browse/PDFBOX-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986752#comment-14986752
]
rey bernal commented on PDFBOX-3079:
------------------------------------
Although there could be difficulties as not all types of bookmark destinations
may work, I should point out that this specific scenario is very common, in
particular with PDF documents that are generated from Word documents. In this
case, the word document table of content is used during the Word to PDF
conversion to create the PDF bookmarks. So there is a vast number of use cases
in where this will work, even if just for PDF documents generated from word.
Is there a way to utilize other information from the bookmark to narrow down
the correct location of the starting text? For example we know the bookmark
text (getTitle()), can we not use the title to aid in the identification of the
stating point?
> Extracting text between bookmarks not working
> ---------------------------------------------
>
> Key: PDFBOX-3079
> URL: https://issues.apache.org/jira/browse/PDFBOX-3079
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: Windows
> Reporter: rey bernal
> Labels: textextraction
> Attachments: Test.java, test.pdf
>
>
> org.apache.pdfbox.text.PDFTextStripper does not really support extraction of
> content between bookmarks. from looking at the code in
> pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
> it is clear that is using the bookmarks that the user provided to determine
> the pages to extract content from.
> There is a business need to extract the text that lies strictly between
> bookmarks. Refer to the attached example program and sample file.
> The extraction to the sections in the first page all return the entire first
> page instead of the content inside each bookmark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]