[ 
https://issues.apache.org/jira/browse/PDFBOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1093.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.8.2
         Assignee: Andreas Lehmkühler

The scratch file of the source pdf isn't used any longer with solving 
PDFBOX-1586. Set to resolved
                
> Copy Page from one Document to another: Page Content Stream Linked to 
> Original Document
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1093
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1093
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: [email protected]
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.8.2
>
>
> When a page is grabbed from one document and added to another (via addPage or 
> importPage) the Content Stream of the page retains the scratchFile and 
> unFiltered/FilteredStreams from it's original document. This means that a 
> page is always connected to it's original document and not wholly a part of 
> it's new document.
> The problem with this situation:
> -When searching for text within a large (800,000 page) pdf file performance 
> can potentially be increased if the pdf file is split into single pages for 
> incremental text extraction. Each page is searched individually rather than 
> an entire document search. To achieve this, a new document is created and a 
> single page from the original pdf is added. 
> -When searching through these 1 page documents, the scratchFile of the 
> original pdf is used, and it will grow as the text from each page is 
> extracted. This leads to an out of memory condition, which appears as a 
> "SEVERE Stop reading corrupt stream" exception from doDecode() as the write 
> buffer attempts to expand to a size greater than the maximum heap size.  
> A workaround for this problem is to create a new document, add the page to 
> the document, save the document, close it and then load it again. 
> Unfortunately the performance cost of this workaround is prohibitive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to