[ 
https://issues.apache.org/jira/browse/PDFBOX-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120538#comment-14120538
 ] 

John Hewson edited comment on PDFBOX-2310 at 9/3/14 9:55 PM:
-------------------------------------------------------------

{code}
I get 107 matches in 24 files. 
{code}

Only class member variables and any code which loops over page resources is 
relevant, I only looked for private fields.

However, as you've spotted even short-term retention of the PDImageXObject 
cache is a problem, and the file from  PDFBOX-2101 is now having issues with 
memory usage. This is due to a number of large images on a single page and 
because PDResources is retaining the PDImageXObject instances during the loop 
over the page's resources we end up accumulating cached images.

However, something's not right here, PDFToImage can render the document without 
any memory issues, and it's not calling PDImageXObject#clear() and it loops 
over the PDResources in exactly the same manner. There's something specific 
about ExtractImages which is causing it to use more memory.

As the author of the PDFormXObject#getImage() method I'm beginning to wonder if 
it should simply not cache images, as they're just so large. Downstream callers 
such as PageDrawer could have their own much smarter caching policies such as 
LRU or some system which takes into account memory pressure such as a 
SoftReference.

Either way, we should try and figure out what's causing ExtractImages to 
consume more memory than PDFToImage.


was (Author: jahewson):
{code}
I get 107 matches in 24 files. 
{code}

Only class member variables and any code which loops over page resources is 
relevant, I only looked for private fields.

However, as you've spotted even short-term retention of the PDImageXObject 
cache is a problem, and the file from  PDFBOX-2101 is now having issues with 
memory usage. This is due to a number of large images on a single page and 
because PDResources is retaining the PDImageXObject instances during the loop 
over the page's resources we end up accumulating cached images.

However, something's not right here, PDFToImage can render the document without 
any memory issues, and it's not calling PDImageXObject#clear() and it loops 
over the PDResources in exactly the same manner. There's something specific 
about ExtractImages which is causing it to use more memory.

As the author of the PDFormXObject#getImage() method I'm beginning to wonder if 
it should simply not cache images, as they're just so large. Downstream callers 
such as PageDrawer could have their own much smarter caching policies such as 
LRU or some system which takes into account memory pressure such as a 
SoftReference.

> codeToGID NPE
> -------------
>
>                 Key: PDFBOX-2310
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2310
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: simon steiner
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>         Attachments: expected.pdf
>
>
> java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar PDFToImage 
> expected.pdf
> Exception in thread "main" java.lang.NullPointerException
>       at 
> org.apache.pdfbox.pdmodel.font.PDType0Font.codeToGID(PDType0Font.java:306)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to