[
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-2370:
------------------------------------
Attachment: PDFBOX-2370-002701.pdf
> Move caching outside of PDResources
> -----------------------------------
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
> Issue Type: Improvement
> Components: PDModel
> Affects Versions: 2.0.0
> Reporter: John Hewson
> Assignee: John Hewson
> Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a
> specific PDPage. This causes two problems, 1) users who want to hold many
> PDPage objects in memory will have high memory use (but this is often by
> accident*). 2) By caching resources in PDPage we only get to keep that cache
> for the lifetime of the page, which e.g. in PDFRenderer is a single page
> only. That means that a font which appears on 40 pages has to be parsed 40
> times, which causes slow running times, but also memory thrashing as objects
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once.
> But that won't work for images, because they're too large. What we're
> beginning to realise is that caching is use-case specific and probably
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing
> resource caching from PDPage/PDResources and implement custom caching in
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll
> happily volunteer myself. The existing high-level PDFBox APIs will continue
> to "just work" and power users will get a level of control that they
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on
> PDResources such as getFonts() and getXObjects() which force all resources of
> a particular type to be loaded, whether or not they are needed, or actually
> used in the content stream. They would be replaced by methods to retrieve a
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved
> the issues which we used to have with image caching (in fact, the
> clearCache() method actually no longer needs to be called by PDFRenderer,
> though it currently is). The real problem is that it's easy to accidentally
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages()
> method is dangerous as looping over it will cause pages to be retained during
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) //
> java.util.List
> {
> // ... this is idiomatic in PDFBox 1.8
> }
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each
> PDPage one at a time, and this is now used internally in PDFBox to avoid the
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of
> returning a List it returns an Iterator<PDPage>, which would provide a nicer
> API than getPage(int) and most existing code will continue to work. This is
> also an opportunity to also fix type safety issues due to PDPageNode and
> incorrect handling of the page tree (this is similar to the issue we had
> recently with the acroform field tree).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]