Copy Page from one Document to another: Page Content Stream Linked to Original 
Document
---------------------------------------------------------------------------------------

                 Key: PDFBOX-1093
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1093
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0
            Reporter: [email protected]


When a page is grabbed from one document and added to another (via addPage or 
importPage) the Content Stream of the page retains the scratchFile and 
unFiltered/FilteredStreams from it's original document. This means that a page 
is always connected to it's original document and not wholly a part of it's new 
document.

The problem with this situation:
-When searching for text within a large (800,000 page) pdf file performance can 
potentially be increased if the pdf file is split into single pages for 
incremental text extraction. Each page is searched individually rather than an 
entire document search. To achieve this, a new document is created and a single 
page from the original pdf is added. 

-When searching through these 1 page documents, the scratchFile of the original 
pdf is used, and it will grow as the text from each page is extracted. This 
leads to an out of memory condition, which appears as a "SEVERE Stop reading 
corrupt stream" exception from doDecode() as the write buffer attempts to 
expand to a size greater than the maximum heap size.  


A workaround for this problem is to create a new document, add the page to the 
document, save the document, close it and then load it again. Unfortunately the 
performance cost of this workaround is prohibitive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to