Viraf Bankwalla created PDFBOX-3700:
---------------------------------------

             Summary: OutOfMemoryException converting PDF to TIFF Images
                 Key: PDFBOX-3700
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3700
             Project: PDFBox
          Issue Type: Bug
          Components: Rendering
    Affects Versions: 2.0.4
            Reporter: Viraf Bankwalla


I am using PDFBox to convert PDF documents to a series of TIFF images (one for 
each page).  The implementation uses PDFRenderer to render each page.  Things 
work fine when I am processing a single document in a single thread, however 
when I try to process multiple documents (each in its own thread) I get an 
OutOfMemoryException.

In analyzing the heap dump, I see that this is caused by the images cached in 
DefaultResourceCache.  Objects are added to the cache in PDResources, which 
includes a method private boolean isAllowedCache(PDXObject xobject) that is 
used to determine whether an PDXObject can be cached.  I have extended this to 
filter out COSName.IMAGE, and am now able to process multiple documents in 
parallel.

A proposed fix would be to include Images in the set of objects not to add to 
the cache.  For example, the following could be added to  
PDResources.isAllowedCache



{code:title=Bar.java|borderStyle=solid}
COSBase image =  xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
{
             return false;            
}
{code}

A possible patch is enclosed below.  I would like to get a fix in for the next 
release.

diff --git a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
index 6e1e464..aa94122 100644
--- a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
+++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
@@ -31,15 +31,15 @@
 import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
 import org.apache.pdfbox.pdmodel.font.PDFont;
 import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
 import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
 import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
 import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
 
 /**
  * A set of resources available at the page/pages/stream level.
@@ -445,6 +445,12 @@
                     return false;
                 }
             }
+            
+            COSBase image = 
xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
+            if (image instanceof COSName && ((COSName) 
image).equals(COSName.IMAGE))
+            {
+               return false;
+            }
         }
         return true;
     }





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to