[jira] [Commented] (PDFBOX-3079) Extracting text between bookmarks not working

Tilman Hausherr (JIRA) Sun, 01 Nov 2015 15:07:16 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984589#comment-14984589
 ]


Tilman Hausherr commented on PDFBOX-3079:
-----------------------------------------

There are 8 different types of bookmark destinations (Table 151 in the 32000 
spec) Of them, only some would qualify for your idea. Even then, you don't know 
for sure that the coordinates are correct, i.e. that they really help to 
exactly mark the part you want to extract.

To understand what I mean, look at the PDF file with 2.0 version of 
PDFDebugger. (Click on "show internal structure") and start here: 
{{Root/Outlines/First/Dest}}

> Extracting text between bookmarks not working
> ---------------------------------------------
>
>                 Key: PDFBOX-3079
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3079
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: rey bernal
>              Labels: textextraction
>         Attachments: Test.java, test.pdf
>
>
> org.apache.pdfbox.text.PDFTextStripper does not really support extraction of 
> content between bookmarks. from looking at the code in 
> pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
>  it is clear that is using the bookmarks that the user provided to determine 
> the pages to extract content from.
> There is a business need to extract the text that lies strictly between 
> bookmarks. Refer to the attached example program and sample file.
> The extraction to the sections in the first page all return the entire first 
> page instead of the content inside each bookmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3079) Extracting text between bookmarks not working

Reply via email to