[
https://issues.apache.org/jira/browse/PDFBOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-1093.
----------------------------------------
Resolution: Fixed
Fix Version/s: 1.8.2
Assignee: Andreas Lehmkühler
The scratch file of the source pdf isn't used any longer with solving
PDFBOX-1586. Set to resolved
> Copy Page from one Document to another: Page Content Stream Linked to
> Original Document
> ---------------------------------------------------------------------------------------
>
> Key: PDFBOX-1093
> URL: https://issues.apache.org/jira/browse/PDFBOX-1093
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Reporter: [email protected]
> Assignee: Andreas Lehmkühler
> Fix For: 1.8.2
>
>
> When a page is grabbed from one document and added to another (via addPage or
> importPage) the Content Stream of the page retains the scratchFile and
> unFiltered/FilteredStreams from it's original document. This means that a
> page is always connected to it's original document and not wholly a part of
> it's new document.
> The problem with this situation:
> -When searching for text within a large (800,000 page) pdf file performance
> can potentially be increased if the pdf file is split into single pages for
> incremental text extraction. Each page is searched individually rather than
> an entire document search. To achieve this, a new document is created and a
> single page from the original pdf is added.
> -When searching through these 1 page documents, the scratchFile of the
> original pdf is used, and it will grow as the text from each page is
> extracted. This leads to an out of memory condition, which appears as a
> "SEVERE Stop reading corrupt stream" exception from doDecode() as the write
> buffer attempts to expand to a size greater than the maximum heap size.
> A workaround for this problem is to create a new document, add the page to
> the document, save the document, close it and then load it again.
> Unfortunately the performance cost of this workaround is prohibitive.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira