[jira] [Commented] (PDFBOX-2792) Text extraction ignores bookmarks

ASF subversion and git services (JIRA) Thu, 14 May 2015 15:21:07 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544426#comment-14544426
 ]


ASF subversion and git services commented on PDFBOX-2792:
---------------------------------------------------------

Commit 1679464 from [~tilman] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1679464 ]

PDFBOX-2792: remove CRs for equality test

> Text extraction ignores bookmarks
> ---------------------------------
>
>                 Key: PDFBOX-2792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.9, 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>              Labels: regression
>             Fix For: 1.8.10, 2.0.0
>
>
> As reported by Noam S. on the user mailing list:
> {quote}
> My problem is that when trying to getText(doc) form a certain section of the 
> pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
> rather than just the text from the specified section.
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
> method always calls resetEngine() method. That will reset all the parameters 
> and delete the bookmarks I set.
> {quote}
> The two lines that reset the bookmarks were added to resetEngine in 
> PDFBOX-1808 in [ https://svn.apache.org/r1553175 ] in an attempt to save some 
> memory.
> Another weird segment can be found in the trunk:
> I also found another weird piece of code in the trunk, which would mean that 
> text extraction would fail if start and end bookmarks are identical:
> {code}
>         if (startPage != null && endPage != null &&
>             startBookmark.getCOSObject() == endBookmark.getCOSObject())
>         {
>             // this is a special case where both the start and end bookmark
>             // are the same but point to nothing.  In this case
>             // we will not extract any text.
>             startBookmarkPageNumber = 0;
>             endBookmarkPageNumber = 0;
>         }
> {code}
>  earlier, that segment was:
> {code}
>        if( startBookmarkPageNumber == -1 && startBookmark != null &&
>                 endBookmarkPageNumber == -1 && endBookmark != null &&
>                 startBookmark.getCOSObject() == endBookmark.getCOSObject() )
>         {
>             //this is a special case where both the start and end bookmark
>             //are the same but point to nothing.  In this case
>             //we will not extract any text.
>             startBookmarkPageNumber = 0;
>             endBookmarkPageNumber = 0;
>         }
> {code}
> which makes more sense. The change was made last year in rev [ 
> https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
> I am writing a test to prevent this from breaking in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2792) Text extraction ignores bookmarks

Reply via email to