[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:
--------------------------------
    Component/s: PDModel

> Move caching outside of PDResources
> -----------------------------------
>
>                 Key: PDFBOX-2370
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>             Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>      // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
>     PDPage page = document.getPage(i);
>     // ... this is the new 2.0 way
>     // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator<PDPage>, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to