[jira] [Commented] (PDFBOX-3079) Extracting text between bookmarks not working

Tilman Hausherr (JIRA) Mon, 02 Nov 2015 22:57:13 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986801#comment-14986801
 ]


Tilman Hausherr commented on PDFBOX-3079:
-----------------------------------------

Did you look at PDFDebugger? You can use the destination parameters to create 
area bounds. For the destination I mentioned, the contents are Page 2, XYZ, 69, 
701, 0. So the coordinates are x=69 and y=701.

> Extracting text between bookmarks not working
> ---------------------------------------------
>
>                 Key: PDFBOX-3079
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3079
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: rey bernal
>              Labels: textextraction
>         Attachments: Test.java, test.pdf
>
>
> org.apache.pdfbox.text.PDFTextStripper does not really support extraction of 
> content between bookmarks. from looking at the code in 
> pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
>  it is clear that is using the bookmarks that the user provided to determine 
> the pages to extract content from.
> There is a business need to extract the text that lies strictly between 
> bookmarks. Refer to the attached example program and sample file.
> The extraction to the sections in the first page all return the entire first 
> page instead of the content inside each bookmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3079) Extracting text between bookmarks not working

Reply via email to