Lior Yaffe created PDFBOX-4739:
----------------------------------

             Summary: Memory issues when rendering pdf to image
                 Key: PDFBOX-4739
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4739
             Project: PDFBox
          Issue Type: Bug
          Components: Rendering
    Affects Versions: 2.0.18
            Reporter: Lior Yaffe
         Attachments: linkedinceoresume.pdf

So I'm trying to write a web service which performs OCR on an input pdf files.

The code is very simple - convert the pdf to tiff files using PDFBox, and then 
use tesseract on the tiff files to get text.

code is very straight forward:

 
{code:java}
private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
    List<ByteArrayOutputStream> fileList = new ArrayList<>();
    PDDocument doc = PDDocument.load(this.bytes);
    doc.setResourceCache(new EmptyCache());

    try {
        PDFRenderer pdfRenderer = new PDFRenderer(doc);
        for (int page = 0; page < doc.getNumberOfPages(); ++page) {
            BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 
300, ImageType.RGB);
            calcImageSize(bufferedImage);
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            ImageIO.write(bufferedImage, "tiff", os);
            os.flush();
            os.close();
            bufferedImage.flush();
            bufferedImage = null;
            fileList.add(os);
        }
    } finally {
        doc.close();
    }

    return fileList;
}
{code}
 

I'm trying to run a sample test which runs this concurrent with 5-6 different 
threads, but the app is crashing very fast.

 

I did some memory tests, and it seems that while the input file is around 70 
kb, the 
{code:java}
pdfRenderer
{code}
object is around 300 MB!! no matter if i'm changing the DPI level, the object 
is still very large.

in addition, only if I'm calling the GC I see the memory drops, even if I'm 
closing the doc object....

 

Basically when I'm running my server with -Xmx6GB with 6 threads in concurrent, 
after 3 runs the service is crashing....what am I missing here?

 

* I attached the input pdf file

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to