[
https://issues.apache.org/jira/browse/PDFBOX-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-2792:
------------------------------------
Description:
As reported by Noam S. on the user mailing list:
{quote}
My problem is that when trying to getText(doc) form a certain section of the
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text
rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc, outputStream)
method always calls resetEngine() method. That will reset all the parameters
and delete the bookmarks I set.
{quote}
The two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808
in [ https://svn.apache.org/r1553175 ] in an attempt to save some memory.
Another weird segment can be found in the trunk:
I also found another weird piece of code in the trunk, which would mean that
text extraction would fail if start and end bookmarks are identical:
{code}
if (startPage != null && endPage != null &&
startBookmark.getCOSObject() == endBookmark.getCOSObject())
{
// this is a special case where both the start and end bookmark
// are the same but point to nothing. In this case
// we will not extract any text.
startBookmarkPageNumber = 0;
endBookmarkPageNumber = 0;
}
{code}
earlier, that segment was:
{code}
if( startBookmarkPageNumber == -1 && startBookmark != null &&
endBookmarkPageNumber == -1 && endBookmark != null &&
startBookmark.getCOSObject() == endBookmark.getCOSObject() )
{
//this is a special case where both the start and end bookmark
//are the same but point to nothing. In this case
//we will not extract any text.
startBookmarkPageNumber = 0;
endBookmarkPageNumber = 0;
}
{code}
which makes more sense. The change was made last year in rev [
https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
I am writing a test to prevent this from breaking in the future.
was:
As reported by Noam S. on the user mailing list:
{quote}
My problem is that when trying to getText(doc) form a certain section of the
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text
rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc, outputStream)
method always calls resetEngine() method. That will reset all the parameters
and delete the bookmarks I set.
{quote}
The two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808
in [ https://svn.apache.org/r1553175 ] in an attempt to save some memory.
> Text extraction ignores bookmarks
> ---------------------------------
>
> Key: PDFBOX-2792
> URL: https://issues.apache.org/jira/browse/PDFBOX-2792
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.9, 2.0.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
>
> As reported by Noam S. on the user mailing list:
> {quote}
> My problem is that when trying to getText(doc) form a certain section of the
> pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text
> rather than just the text from the specified section.
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream)
> method always calls resetEngine() method. That will reset all the parameters
> and delete the bookmarks I set.
> {quote}
> The two lines that reset the bookmarks were added to resetEngine in
> PDFBOX-1808 in [ https://svn.apache.org/r1553175 ] in an attempt to save some
> memory.
> Another weird segment can be found in the trunk:
> I also found another weird piece of code in the trunk, which would mean that
> text extraction would fail if start and end bookmarks are identical:
> {code}
> if (startPage != null && endPage != null &&
> startBookmark.getCOSObject() == endBookmark.getCOSObject())
> {
> // this is a special case where both the start and end bookmark
> // are the same but point to nothing. In this case
> // we will not extract any text.
> startBookmarkPageNumber = 0;
> endBookmarkPageNumber = 0;
> }
> {code}
> earlier, that segment was:
> {code}
> if( startBookmarkPageNumber == -1 && startBookmark != null &&
> endBookmarkPageNumber == -1 && endBookmark != null &&
> startBookmark.getCOSObject() == endBookmark.getCOSObject() )
> {
> //this is a special case where both the start and end bookmark
> //are the same but point to nothing. In this case
> //we will not extract any text.
> startBookmarkPageNumber = 0;
> endBookmarkPageNumber = 0;
> }
> {code}
> which makes more sense. The change was made last year in rev [
> https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
> I am writing a test to prevent this from breaking in the future.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]