[
https://issues.apache.org/jira/browse/PDFBOX-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544426#comment-14544426
]
ASF subversion and git services commented on PDFBOX-2792:
---------------------------------------------------------
Commit 1679464 from [~tilman] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1679464 ]
PDFBOX-2792: remove CRs for equality test
> Text extraction ignores bookmarks
> ---------------------------------
>
> Key: PDFBOX-2792
> URL: https://issues.apache.org/jira/browse/PDFBOX-2792
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.9, 2.0.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Labels: regression
> Fix For: 1.8.10, 2.0.0
>
>
> As reported by Noam S. on the user mailing list:
> {quote}
> My problem is that when trying to getText(doc) form a certain section of the
> pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text
> rather than just the text from the specified section.
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream)
> method always calls resetEngine() method. That will reset all the parameters
> and delete the bookmarks I set.
> {quote}
> The two lines that reset the bookmarks were added to resetEngine in
> PDFBOX-1808 in [ https://svn.apache.org/r1553175 ] in an attempt to save some
> memory.
> Another weird segment can be found in the trunk:
> I also found another weird piece of code in the trunk, which would mean that
> text extraction would fail if start and end bookmarks are identical:
> {code}
> if (startPage != null && endPage != null &&
> startBookmark.getCOSObject() == endBookmark.getCOSObject())
> {
> // this is a special case where both the start and end bookmark
> // are the same but point to nothing. In this case
> // we will not extract any text.
> startBookmarkPageNumber = 0;
> endBookmarkPageNumber = 0;
> }
> {code}
> earlier, that segment was:
> {code}
> if( startBookmarkPageNumber == -1 && startBookmark != null &&
> endBookmarkPageNumber == -1 && endBookmark != null &&
> startBookmark.getCOSObject() == endBookmark.getCOSObject() )
> {
> //this is a special case where both the start and end bookmark
> //are the same but point to nothing. In this case
> //we will not extract any text.
> startBookmarkPageNumber = 0;
> endBookmarkPageNumber = 0;
> }
> {code}
> which makes more sense. The change was made last year in rev [
> https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
> I am writing a test to prevent this from breaking in the future.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]