Guyenot Jeremy created PDFBOX-1808:
--------------------------------------

             Summary: PDFTextStripper.getText - hight memory usage
                 Key: PDFBOX-1808
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.3, 1.8.2
         Environment: Windows 7
Java jdk 1.7.0_45
            Reporter: Guyenot Jeremy
            Priority: Critical


Hello,

i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
use a lot of memory.
With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
I also constat that the memory is'nt free after the getText method is called.

You can see my code bellow:
double virgule = Math.pow(10, 2);
                System.out.println("START - Total memory (Mo): " + 
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
PDDocument cd = PDDocument.load(file);
                System.out.println("PDDocument getNumberOfPages - Nombre de 
pages: " + cd.getNumberOfPages());
                System.out.println("PDDocument load - Total memory (Mo): " + 
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
String pdfText = "";
try{
        PDFTextStripper stripper = new PDFTextStripper();
        pdfText = stripper.getText(cd);
                        System.out.println("PDFTextStripper getText - Total 
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
virgule) / virgule);
        stripper.resetEngine();
        stripper = null;
                        System.out.println("PDFTextStripper resetEngine - Total 
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
virgule) / virgule);
}
finally{
        if( cd!=null ){
                cd.close();
                cd = null;
                                System.out.println("PDDocument close - Total 
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
virgule) / virgule);
        }
}
retour = new TextField(fieldName, pdfText, Field.Store.NO);
                System.out.println("TextField - Total memory (Mo): " + 
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);


And the result into my output window:
START - Total memory (Mo): 95.0
PDDocument getNumberOfPages - Nombre de pages: 2676
PDDocument load - Total memory (Mo): 121.0
PDFTextStripper getText - Total memory (Mo): 757.0
PDFTextStripper resetEngine - Total memory (Mo): 757.0
PDDocument close - Total memory (Mo): 757.0
TextField - Total memory (Mo): 757.0
pdfText - Total memory (Mo): 757.0

I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to