[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012298#comment-14012298 ]
Tim Allison commented on PDFBOX-2101: ------------------------------------- Thank you, all, for your work this! I can't speak for the entire Tika community, but I suspect that the most common use case would be to extract one of each image (whether or not the image appears on 20 pages). A caching parameter would be very handy for this. For those who want to extract 20 copies of the same image, they can choose to take the potential memory hit for the sake of speed. We have a decent method to configure PDFBox on Tika, and it would be great to add this if it isn't too much effort. Thank you, again. > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmkühler > Priority: Minor > Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg > > > ExtractImages seems to fail to release memory resources on some files in both > PDFBox 1.8.5 and trunk. > On this file 4MB file > [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if > extracting every image on every page (and there are many, many duplicate > images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g > available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have > masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages > 239665.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar > org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: > 514) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP > ixelMap.java:217) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr > eam(PDPixelMap.java:363) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( > PDXObjectImage.java:254) > at > org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 > 02) > at > org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( > SampledImageReader.java:171) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma > ge(SampledImageReader.java:154) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm > ageXObject.java:171) > at > org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 > 31) > at > org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. > java:206) > at > org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)