[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

Guyenot Jeremy (JIRA) Sat, 14 Dec 2013 04:16:23 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848342#comment-13848342
 ]


Guyenot Jeremy commented on PDFBOX-1808:
----------------------------------------

Hello,

after more tests i find some case where the memory leaks:
1) after extracting text from certain pdf the memory is not free
START - Total memory (Mo): 468.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\DOSSIER DE 
CANDIDATURE_001.pdf
-- File size (ko): 4975.0
----- PDDocument.load - Total memory (Mo): 468.0
----- PDDocument.getNumberOfPages : 2676
----- PDFTextStripper.getText - Total memory (Mo): 747.0
START - Total memory (Mo): 745.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\4 - EVALUATION ET 
BILANS\BILAN SOCIAL\Reprise adulte_001.pdf
-- File size (ko): 79.0
-- File size (Mo): 0.0
----- PDDocument.load - Total memory (Mo): 745.0
----- PDDocument.getNumberOfPages : 2
----- PDFTextStripper.getText - Total memory (Mo): 745.0

2) on certain other i find this:
START - Total memory (Mo): 268.0
-- File : D:\Armoires\DEVEARM\mphh\ocr\188\94458\4 - EVALUATION ET 
BILANS\ADMISSION URGENCE\Reprise adulte_003.pdf
-- File size (ko): 1183.0
-- File size (Mo): 1.0
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2110 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 1286 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 706 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 420 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 936 is wrong. Fall back to reading stream 
until 'endstream'.
----- PDDocument.load - Total memory (Mo): 268.0
----- PDDocument.getNumberOfPages : 41
----- PDFTextStripper.getText - Total memory (Mo): 469.0
START - Total memory (Mo): 469.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\0 - 
INSTRUCTION\RECEVABILITE\AR Complet_001.pdf
-- File size (ko): 115.0
-- File size (Mo): 0.0
----- PDDocument.load - Total memory (Mo): 469.0
----- PDDocument.getNumberOfPages : 3
----- PDFTextStripper.getText - Total memory (Mo): 469.0

You can see that the memory is not free after use.
I can't give you my pdf files because they contained some personnals 
informations.

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Priority: Critical
>              Labels: performance
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
> use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
>               System.out.println("START - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
>               System.out.println("PDDocument getNumberOfPages - Nombre de 
> pages: " + cd.getNumberOfPages());
>               System.out.println("PDDocument load - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText = "";
> try{
>       PDFTextStripper stripper = new PDFTextStripper();
>       pdfText = stripper.getText(cd);
>                       System.out.println("PDFTextStripper getText - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       stripper.resetEngine();
>       stripper = null;
>                       System.out.println("PDFTextStripper resetEngine - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
> }
> finally{
>       if( cd!=null ){
>               cd.close();
>               cd = null;
>                               System.out.println("PDDocument close - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       }
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
>               System.out.println("TextField - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

Reply via email to