[
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848464#comment-13848464
]
Tilman Hausherr edited comment on PDFBOX-1808 at 12/14/13 9:34 PM:
-------------------------------------------------------------------
I did a test with the PDF specification PDF, running it 5 times, freeing
everything and then just idling.
START - Total memory (Mo): 128.0
strip size: 2595883
PDDocument close - Total memory (Mo): 947.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1093.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1091.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1169.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
Yes, there's a lot. I looked at it with the profiler. I believe its all static
objects. I made screenshots of the count of pdfbox objects, and then did a run
with 50 strips. I compared the amount of org.apache.* objects after doing a gc,
and its the same.
That java itself is using more and more memory, has to do with poor memory
management of the java runtime.
was (Author: tilman):
I did a test with the PDF specification PDF, running it 5 times, freeing
everything and then just idling.
START - Total memory (Mo): 128.0
strip size: 2595883
PDDocument close - Total memory (Mo): 947.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1093.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1091.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1169.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
After sleep - Total memory (Mo): 1192.0
Yes, there's a lot. I looked at it with the profiler. I believe its all static
objects. I made screenshots of the count of pdfbox objects, and then did a run
with 50 strips. I compared the amount of org.apache.* objects after doing a gc,
and its the same.
> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
> Key: PDFBOX-1808
> URL: https://issues.apache.org/jira/browse/PDFBOX-1808
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.2, 1.8.3
> Environment: Windows 7
> Java jdk 1.7.0_45
> Reporter: Guyenot Jeremy
> Priority: Critical
> Labels: performance
> Attachments: DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png,
> s50-1.png, s50-2.png
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper
> use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
> System.out.println("START - Total memory (Mo): " +
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
> System.out.println("PDDocument getNumberOfPages - Nombre de
> pages: " + cd.getNumberOfPages());
> System.out.println("PDDocument load - Total memory (Mo): " +
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText = "";
> try{
> PDFTextStripper stripper = new PDFTextStripper();
> pdfText = stripper.getText(cd);
> System.out.println("PDFTextStripper getText - Total
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
> virgule) / virgule);
> stripper.resetEngine();
> stripper = null;
> System.out.println("PDFTextStripper resetEngine - Total
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
> virgule) / virgule);
> }
> finally{
> if( cd!=null ){
> cd.close();
> cd = null;
> System.out.println("PDDocument close - Total
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) *
> virgule) / virgule);
> }
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
> System.out.println("TextField - Total memory (Mo): " +
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)