[jira] [Updated] (PDFBOX-2792) Text extraction ignores bookmarks

Tilman Hausherr (JIRA) Sun, 10 May 2015 06:32:38 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-2792:
------------------------------------
    Description: 
As reported by Noam S. on the user mailing list:
{quote}
My problem is that when trying to getText(doc) form a certain section of the 
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
method always calls resetEngine() method. That will reset all the parameters 
and delete the bookmarks I set.
{quote}
The two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808 
in [ https://svn.apache.org/r1553175 ] in an attempt to save some memory.

Another weird segment can be found in the trunk:

I also found another weird piece of code in the trunk, which would mean that 
text extraction would fail if start and end bookmarks are identical:
{code}
        if (startPage != null && endPage != null &&
            startBookmark.getCOSObject() == endBookmark.getCOSObject())
        {
            // this is a special case where both the start and end bookmark
            // are the same but point to nothing.  In this case
            // we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }
{code}
 earlier, that segment was:
{code}
       if( startBookmarkPageNumber == -1 && startBookmark != null &&
                endBookmarkPageNumber == -1 && endBookmark != null &&
                startBookmark.getCOSObject() == endBookmark.getCOSObject() )
        {
            //this is a special case where both the start and end bookmark
            //are the same but point to nothing.  In this case
            //we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }
{code}
which makes more sense. The change was made last year in rev [ 
https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.

I am writing a test to prevent this from breaking in the future.

  was:
As reported by Noam S. on the user mailing list:
{quote}
My problem is that when trying to getText(doc) form a certain section of the 
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
method always calls resetEngine() method. That will reset all the parameters 
and delete the bookmarks I set.
{quote}
The two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808 
in [ https://svn.apache.org/r1553175 ] in an attempt to save some memory.


> Text extraction ignores bookmarks
> ---------------------------------
>
>                 Key: PDFBOX-2792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.9, 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>
> As reported by Noam S. on the user mailing list:
> {quote}
> My problem is that when trying to getText(doc) form a certain section of the 
> pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
> rather than just the text from the specified section.
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
> method always calls resetEngine() method. That will reset all the parameters 
> and delete the bookmarks I set.
> {quote}
> The two lines that reset the bookmarks were added to resetEngine in 
> PDFBOX-1808 in [ https://svn.apache.org/r1553175 ] in an attempt to save some 
> memory.
> Another weird segment can be found in the trunk:
> I also found another weird piece of code in the trunk, which would mean that 
> text extraction would fail if start and end bookmarks are identical:
> {code}
>         if (startPage != null && endPage != null &&
>             startBookmark.getCOSObject() == endBookmark.getCOSObject())
>         {
>             // this is a special case where both the start and end bookmark
>             // are the same but point to nothing.  In this case
>             // we will not extract any text.
>             startBookmarkPageNumber = 0;
>             endBookmarkPageNumber = 0;
>         }
> {code}
>  earlier, that segment was:
> {code}
>        if( startBookmarkPageNumber == -1 && startBookmark != null &&
>                 endBookmarkPageNumber == -1 && endBookmark != null &&
>                 startBookmark.getCOSObject() == endBookmark.getCOSObject() )
>         {
>             //this is a special case where both the start and end bookmark
>             //are the same but point to nothing.  In this case
>             //we will not extract any text.
>             startBookmarkPageNumber = 0;
>             endBookmarkPageNumber = 0;
>         }
> {code}
> which makes more sense. The change was made last year in rev [ 
> https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
> I am writing a test to prevent this from breaking in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-2792) Text extraction ignores bookmarks

Reply via email to