[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012326#comment-14012326 ]
Tim Allison edited comment on PDFBOX-2101 at 5/29/14 12:45 PM: --------------------------------------------------------------- Ah, ok, thank you. That makes sense. To confirm my understanding of [~jeremias.mae...@outline.ch]'s point...PDFBox is caching the uncompressed image? That would explain why I'm seeing this: I'm running hprof with trunk now with no -Xmx on a linux box, and ExtractImages has exported 223 images (many more to go!). The exported images take up ~17m, but Java is choosing to use 1.1gb of memory. That would also explain why I was getting 2.6g of corrupt images from this file when I was just writing directly to the outputstream instead of using the correct image utils (thank you, [~tilman] for pointing that out!). I'll submit the hprof results when that completes for kicks... was (Author: talli...@mitre.org): Ah, ok, thank you. That makes sense. To confirm my understanding of [~jeremias.mae...@outline.ch]'s point...PDFBox is caching the uncompressed image? That would explain why I'm seeing this: I'm running hprof with trunk now with no -Xmx on a linux box, and ExtractImages has exported 223 images (many more to go!). The exported images take up ~17m, but Java is choosing to use 1.1gb of memory. I'll submit the hprof results when that completes for kicks... > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmkühler > Priority: Minor > Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg > > > ExtractImages seems to fail to release memory resources on some files in both > PDFBox 1.8.5 and trunk. > On this file 4MB file > [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if > extracting every image on every page (and there are many, many duplicate > images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g > available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have > masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages > 239665.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar > org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: > 514) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP > ixelMap.java:217) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr > eam(PDPixelMap.java:363) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( > PDXObjectImage.java:254) > at > org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 > 02) > at > org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( > SampledImageReader.java:171) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma > ge(SampledImageReader.java:154) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm > ageXObject.java:171) > at > org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 > 31) > at > org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. > java:206) > at > org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)