Guyenot Jeremy created PDFBOX-1808:
--------------------------------------
Summary: PDFTextStripper.getText - hight memory usage
Key: PDFBOX-1808
URL: https://issues.apache.org/jira/browse/PDFBOX-1808
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.3, 1.8.2
Environment: Windows 7
Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
Hello,
i'm trying to extract text from pdfs but i can find that the PDFTextStripper
use a lot of memory.
With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
I also constat that the memory is'nt free after the getText method is called.
You can see my code bellow:
double virgule = Math.pow(10, 2);
System.out.println("START - Total memory (Mo): " +
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
PDDocument cd = PDDocument.load(file);
System.out.println("PDDocument getNumberOfPages - Nombre de
pages: " + cd.getNumberOfPages());
System.out.println("PDDocument load - Total memory (Mo): " +
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
String pdfText = "";
try{
PDFTextStripper stripper = new PDFTextStripper();
pdfText = stripper.getText(cd);
System.out.println("PDFTextStripper getText - Total
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
virgule) / virgule);
stripper.resetEngine();
stripper = null;
System.out.println("PDFTextStripper resetEngine - Total
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
virgule) / virgule);
}
finally{
if( cd!=null ){
cd.close();
cd = null;
System.out.println("PDDocument close - Total
memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
virgule) / virgule);
}
}
retour = new TextField(fieldName, pdfText, Field.Store.NO);
System.out.println("TextField - Total memory (Mo): " +
Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
And the result into my output window:
START - Total memory (Mo): 95.0
PDDocument getNumberOfPages - Nombre de pages: 2676
PDDocument load - Total memory (Mo): 121.0
PDFTextStripper getText - Total memory (Mo): 757.0
PDFTextStripper resetEngine - Total memory (Mo): 757.0
PDDocument close - Total memory (Mo): 757.0
TextField - Total memory (Mo): 757.0
pdfText - Total memory (Mo): 757.0
I also try to call System.gc() but the memory use is the same.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)