[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

Tilman Hausherr (JIRA) Mon, 16 Dec 2013 09:19:51 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849353#comment-13849353
 ]


Tilman Hausherr commented on PDFBOX-1808:
-----------------------------------------

Re your comment after having tried clearResources() - you didn't clean up the 
stripper itself and all the rest, which you did in your test program. Plus, you 
should look at it with the profiler like you did last time.

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Priority: Critical
>              Labels: performance
>         Attachments: 1808-java char copyof.jpg, 1808-java char 
> copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
> 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
> s50-1.png, s50-2.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
> use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
>               System.out.println("START - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
>               System.out.println("PDDocument getNumberOfPages - Nombre de 
> pages: " + cd.getNumberOfPages());
>               System.out.println("PDDocument load - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText = "";
> try{
>       PDFTextStripper stripper = new PDFTextStripper();
>       pdfText = stripper.getText(cd);
>                       System.out.println("PDFTextStripper getText - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       stripper.resetEngine();
>       stripper = null;
>                       System.out.println("PDFTextStripper resetEngine - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
> }
> finally{
>       if( cd!=null ){
>               cd.close();
>               cd = null;
>                               System.out.println("PDDocument close - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       }
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
>               System.out.println("TextField - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

Reply via email to