[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

Dave Smith (JIRA) Wed, 04 Jun 2014 09:15:48 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017814#comment-14017814
 ]


Dave Smith commented on PDFBOX-2101:
------------------------------------


What we do is convert each page of the pdf to an image. Once I have the image I 
am done with the page. What would be nice is if the references that the page 
was holding could be cleared out of the global cache. If page 2 needed a filter 
that was already loaded on page one then so be it. Right now we can not render 
more than 30 pages without the JMV running out of memory. Sure it might be a 
bit slower but it is better than it not working at all..



> Surprising memory consumption when extracting images
> ----------------------------------------------------
>
>                 Key: PDFBOX-2101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.5
>         Environment: Windows 7
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>         Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
> PDFBOX-2101-714-poor.jpg, java.hprof.zip
>
>
> ExtractImages seems to fail to release memory resources on some files in both 
> PDFBox 1.8.5 and trunk.  
> On this file 4MB file 
> [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
> extracting every image on every page (and there are many, many duplicate 
> images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g 
> available, ExtractImages will work.
> With some experimentation, the triggers seem to be JPEG images that have 
> masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
> Commandlines:
> 1.8.5:
> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
> 239665.pdf
> 2.0_SNAPSHOT:
> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
> org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
> Results:
> 1.8.5: 906 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at 
> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
> 514)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
> ixelMap.java:217)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
> eam(PDPixelMap.java:363)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
> PDXObjectImage.java:254)
>         at 
> org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
> 02)
>         at 
> org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
> {noformat}
> 2.0_SNAPSHOT: 428 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
> SampledImageReader.java:171)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
> ge(SampledImageReader.java:154)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
> ageXObject.java:171)
>         at 
> org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
> 31)
>         at 
> org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
> java:206)
>         at 
> org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
> a:164)
>         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

Reply via email to