[jira] [Comment Edited] (PDFBOX-2370) Move caching outside of PDResources

Tilman Hausherr (JIRA) Thu, 17 Sep 2015 12:57:16 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803131#comment-14803131
 ]


Tilman Hausherr edited comment on PDFBOX-2370 at 9/17/15 7:55 PM:
------------------------------------------------------------------

I have partially disabled the cache. While searching for "interesting" pdfs in 
the digitalcorpora files, I got problems with the attached file 
PDFBOX-2370-002701.pdf, getting errors that certain pattern colorspaces didn't 
exist.
{code}
Caused by: java.io.IOException: pattern COSName{P3} was not found
        at 
org.apache.pdfbox.pdmodel.graphics.color.PDPattern.getPattern(PDPattern.java:110)
        at org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:233)
        at 
org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:542)
        at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:606)
{code}
The cause is that when a pattern is created the first time, it also gets a 
reference to the resources. However these resources change from page to page. 
So I disabled caching for objects that hold a pointer to resources.

This sequence in page 5
{code}
/Cs8 cs
0 0 0 /P3 scn
{code}
is a problem because for Cs8, PageDrawer.getPaint() calls 
PDPattern.getPattern(color) and this one accesses the resources of Cs8 it had 
from the first encounter, which was on page 3. But in the resources of page 3, 
there is no pattern P3, which results in the error.

That bug wasn't happening before because the cache was disabled due to the bug 
I fixed 3 days ago in rev 1703017.

Either we can't cache XObjects and Colorspaces, or we should not have them 
keeping a reference to a resources object.


was (Author: tilman):
I have partially disabled the cache. While searching for "interesting" pdfs in 
the digitalcorpora files, I got problems with the attached file 
PDFBOX-2370-002701.pdf, getting errors that certain pattern colorspaces didn't 
exist.
{code}
Caused by: java.io.IOException: pattern COSName{P3} was not found
        at 
org.apache.pdfbox.pdmodel.graphics.color.PDPattern.getPattern(PDPattern.java:110)
        at org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:233)
        at 
org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:542)
        at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:606)
{code}
The cause is that when a pattern is created the first time, it also gets a 
pointer to the resources. However these resources change from page to page. So 
I disabled caching for objects that hold a pointer to resources.

This sequence in page 5
{code}
/Cs8 cs
0 0 0 /P3 scn
{code}
is a problem because for Cs8, PageDrawer.getPaint() calls 
PDPattern.getPattern(color) and this one accesses the resources of Cs8 it had 
from the first encounter, which was on page 3. But in the resources of page 3, 
there is no pattern P3, which results in the error.

That bug wasn't happening before because the cache was disabled due to the bug 
I fixed 3 days ago in rev 1703017.

Either we can't cache XObjects and Colorspaces, or we should not have them 
keeping a reference to a resources object.

> Move caching outside of PDResources
> -----------------------------------
>
>                 Key: PDFBOX-2370
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>      // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
>     PDPage page = document.getPage(i);
>     // ... this is the new 2.0 way
>     // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator<PDPage>, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2370) Move caching outside of PDResources

Reply via email to