[jira] [Commented] (PDFBOX-1715) java.lang.OutOfMemoryError when extracting images

Fred Hansen (JIRA) Wed, 11 Sep 2013 13:27:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764718#comment-13764718
 ]


Fred Hansen commented on PDFBOX-1715:
-------------------------------------

This might not be a PDFBOX problem. If the program retains pointers to all the 
images, it would be taking up a lot of space. If each pixel takes 8 bytes, 2k 
square images would account for the space consumption. How big are the images? 
How many images does the program retain pointers to? It may be necessary to 
write the image one-at-a-time to disk, being sure that no program variables 
remain pointing to completed/written images.

(Or PDFBOX could retain images/internal work buffers, again accounting for the 
memory consumption.
                
> java.lang.OutOfMemoryError when extracting images
> -------------------------------------------------
>
>                 Key: PDFBOX-1715
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1715
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.1
>         Environment: LSB Version:    
> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: CentOS
> Description:    CentOS release 4.7 (Final)
> Release:        4.7
> Codename:       Final
> Java 1.6.0
>            Reporter: sarathy
>
> We are trying to extract images from PDF file. As part of that, we are 
> converting a PDPage into an image. using PDPage.convertImage method. Its a 52 
> page document.
> At that time, We are seeing the following trace:
> Here are the steps:
> PDDocument document = PDDocument.load(inputStream);
> List<PDPage> pages = document.getDocumentCatalog().getAllPages();
> for (PDPage pdPage : pages) {
>    if (pdPage.getResources() != null && pdPage.getResources().getImages() != 
> null)
>      PageInfo  page = new PageInfo(pdPage, true, rotation);
>      ...
>    }
> }
> In PageInfo, we are doing:
> BufferedImage bimage = page.convertToImage();
> And after processing about 12 or so pages, it starts complaining as follows.
> java.lang.OutOfMemoryError: Java heap space
>         at 
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:263)
>         at 
> org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:222)
>         at 
> org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
>         at java.io.OutputStream.write(OutputStream.java:75)
>         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:102)
>         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:295)
>         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:237)
>         at 
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:172)
>         at 
> org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:231)
>         at 
> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:509)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:185)
>         at 
> org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:83)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>         at 
> org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:125)
>         at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:781)
>         at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:712)
>         at oss.rcpt.PageInfo.<init>(PageInfo.java:328)
>         at oss.utl.PDFImageSplitter.execute(PDFImageSplitter.java:217)
>         at oss.utl.PDFUtilities.getImageCount(PDFUtilities.java:165)
>         at cms.utl.PDFImageOperations.main(PDFImageOperations.java:157)
> when we run this from command line, 
> * if we set -Xms=512m and -Xmx=512m, its complaining after 12 pages.
> * if we set -Xms=1024m and -Xmx=1024m, its complaining after 27 pages.
> On the side, we are also getting "Colour key masking isn't supported" message 
> for each image in the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1715) java.lang.OutOfMemoryError when extracting images

Reply via email to