[ 
https://issues.apache.org/jira/browse/PDFBOX-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013863#comment-17013863
 ] 

Lior Yaffe commented on PDFBOX-4739:
------------------------------------

I added a new method:
{code:java}
private void printUsedMemory(String text) {
    long freeMemory = Runtime.getRuntime().totalMemory() - 
Runtime.getRuntime().freeMemory();
    long mb = freeMemory / 1000000;
    System.out.println(text + "....Used memory: " + mb + " MB");
}
{code}
and changed the previous code to:
{code:java}
private List<ByteArrayOutputStream> convertPdfToTiff() throws IOException {
    List<ByteArrayOutputStream> fileList = new ArrayList<>();
    PDDocument doc = PDDocument.load(this.bytes);
    doc.setResourceCache(new EmptyCache());

    try {
        PDFRenderer pdfRenderer = new PDFRenderer(doc);
        printUsedMemory("Before");
        for (int page = 0; page < doc.getNumberOfPages(); ++page) {
            BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 
300, ImageType.RGB);
            calcImageSize(bufferedImage);
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            ImageIO.write(bufferedImage, "tiff", os);
            os.flush();
            os.close();
            bufferedImage.flush();
            bufferedImage = null;
            fileList.add(os);
        }
        printUsedMemory("After");
    } finally {
        doc.close();
    }

    return fileList;
}
{code}
the output is:

Before....Used memory: 39 MB
After....Used memory: 385 MB

 

 

When going to 3 concurrent threads, it's:

Before....Used memory: 50 MB
Before....Used memory: 50 MB
Before....Used memory: 50 MB
After....Used memory: 952 MB
After....Used memory: 1082 MB
After....Used memory: 1107 MB

> Memory issues when rendering pdf to image
> -----------------------------------------
>
>                 Key: PDFBOX-4739
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4739
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.18
>            Reporter: Lior Yaffe
>            Priority: Blocker
>         Attachments: linkedinceoresume.pdf
>
>
> So I'm trying to write a web service which performs OCR on an input pdf files.
> The code is very simple - convert the pdf to tiff files using PDFBox, and 
> then use tesseract on the tiff files to get text.
> code is very straight forward:
>  
> {code:java}
> private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
>     List<ByteArrayOutputStream> fileList = new ArrayList<>();
>     PDDocument doc = PDDocument.load(this.bytes);
>     doc.setResourceCache(new EmptyCache());
>     try {
>         PDFRenderer pdfRenderer = new PDFRenderer(doc);
>         for (int page = 0; page < doc.getNumberOfPages(); ++page) {
>             BufferedImage bufferedImage = 
> pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
>             calcImageSize(bufferedImage);
>             ByteArrayOutputStream os = new ByteArrayOutputStream();
>             ImageIO.write(bufferedImage, "tiff", os);
>             os.flush();
>             os.close();
>             bufferedImage.flush();
>             bufferedImage = null;
>             fileList.add(os);
>         }
>     } finally {
>         doc.close();
>     }
>     return fileList;
> }
> {code}
>  
> I'm trying to run a sample test which runs this concurrent with 5-6 different 
> threads, but the app is crashing very fast.
>  
> I did some memory tests, and it seems that while the input file is around 70 
> kb, the 
> {code:java}
> pdfRenderer
> {code}
> object is around 300 MB!! no matter if i'm changing the DPI level, the object 
> is still very large.
> in addition, only if I'm calling the GC I see the memory drops, even if I'm 
> closing the doc object....
>  
> Basically when I'm running my server with -Xmx6GB with 6 threads in 
> concurrent, after 3 runs the service is crashing....what am I missing here?
>  
>  * I attached the input pdf file
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to