subject:"\[jira\] \[Commented\] \(PDFBOX\-1808\) PDFTextStripper.getText \- hight memory usage"

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-02-06 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893517#comment-13893517
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


When i use my windows task manager this is the same result.
After a long moment the memory is release.

I try with the 1.8.4.SNAPSHOT.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Fix For: 1.8.4, 2.0.0

 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, 
 Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-02-06 Thread John Hewson (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893672#comment-13893672
 ] 

John Hewson commented on PDFBOX-1808:
-

[~jguyenot], perhaps the JVM settings in http://stackoverflow.com/a/4625142 
will help you.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Fix For: 1.8.4, 2.0.0

 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, 
 Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-22 Thread Timo Boehme (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878461#comment-13878461
 ] 

Timo Boehme commented on PDFBOX-1808:
-

[~jguyenot] please inform yourself about the meaning of the memory statistics 
provided by Java. *Total memory* is (as the name says) all the memory the VM 
uses. What you want is the used memory (by your application). This has to be 
calculated by totalMem - freeMem (see e.g. 
http://stackoverflow.com/questions/3571203/what-is-the-exact-meaning-of-runtime-getruntime-totalmemory-and-freememory)

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, 
 Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-22 Thread Timo Boehme (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878466#comment-13878466
 ] 

Timo Boehme commented on PDFBOX-1808:
-

one addition to my last comment: it is JVM implementation dependent if in case 
of large free memory the JVM will release memory to the operating system. In 
case of server VMs they typically keep the allocated memory - independent if 
the Java application still needs it.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, 
 Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-20 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876343#comment-13876343
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


Hello,

my netbeans project is like in netbeans_project.jpg.
So how can i integrated the sources of 
http://svn.apache.org/repos/asf/pdfbox/trunk/; in place of the 
pdfbox-app-1.8.3.jar?

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, 
 s5-1.png, s5-2.png, s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-20 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876401#comment-13876401
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

With netbeans, do Team, subversion, checkout

enter 
http://svn.apache.org/repos/asf/pdfbox/trunk
next
replace
pdfbox/trunk with pdfbox/branches/1.8
finish

...wait some time...

netbeans will ask you to create a project, do so. At the next question, click 
on pdfbox reactor only. It will create a project pdfbox reactor. The do a 
build on that one.

The jar files you need will be in the xxx\target directories. Or all together 
in 
1.8\war\target\pdfbox-war-2.0.0-SNAPSHOT\WEB-INF\lib


If you can't do it at work because of firewall, do it from home.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, 
 s5-1.png, s5-2.png, s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-20 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876438#comment-13876438
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


Thanks Tilman Hausherr,

I create a project and compile.
I added the neww jar files to my library folder and i test.

This is the result:
# File : DOSSIER DE CANDIDATURE_001.pdf
# START - Total memory (Mo): 167.0
# PDDocument getNumberOfPages - Nombre de pages: 2676
# PDDocument load - Total memory (Mo): 167.0
# PDFTextStripper getText - Total memory (Mo): 706.0
# PDDocument close - Total memory (Mo): 706.0

You can see that the memory is not released after treatment.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, 
 s5-1.png, s5-2.png, s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13873111#comment-13873111
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I added most of the changes (excluding 1553174 as it introduces an api 
incompatibility) to the 1.8 branch in revision 1558705.

[~jguyenot] Any luck with your test? Do you need some additional 
help/information?

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-14 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870503#comment-13870503
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


Hello,
thank you for your response.
I try to integrate your changes into my project but i use the .jar.
Is it possible to have version jar of all changes to the Netbeans integrated 
into my project?

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-14 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870556#comment-13870556
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

Can you access the svn repository from where you are now, or is your firewall 
preventing to do it? The url is 
http://svn.apache.org/repos/asf/pdfbox/trunk/
(see http://pdfbox.apache.org/downloads.html#scm )


 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870566#comment-13870566
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


[~jguyenot] There is no official release containing those changes and we don't 
have any plans to release one yet. If you are using maven, you can adress the 
SNAPSHOT versions (1.8.4-SNAPSHOT or 2.0.0-SNAPSHOT) of PDFBox or simply 
download the latest SNAPSHOT build from 
[nexus|https://repository.apache.org/index.html]. But be aware that these are 
unstable builds.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868761#comment-13868761
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I added a null check to PDDocument#close in revision 1557374 to avoid a NPE 
when splitting a pdf

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-04 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862257#comment-13862257
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

Lets just wait until Jeremy is back from vacation. My own tests show a higher 
usage (?!) but I don't trust java on telling the truth, and there were other 
changes since then. I didn't save my own memory .nps snapshot so I can't 
compare. Your code changes looked like it should have improved.

START - Total memory (Mo): 128.0
strip size: 2595883
PDDocument close - Total memory (Mo): 926.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1026.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1027.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1021.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1171.0
After sleep - Total memory (Mo): 1174.0
After sleep - Total memory (Mo): 1174.0


 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862278#comment-13862278
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I'm using Java VisualVM (it's part of the jdk) as profiler. It has a lot of 
monitoring features, e.g. one can see all living objects so that it is simply 
possible to see if those can be finalized or not.

In my environment all objects were released, at least after triggering the GC.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2014-01-03 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861759#comment-13861759
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


Can anybody confirm, that my changes are working well? I'd like to add those to 
the 1.8 branch as well.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855838#comment-13855838
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I removed some cached values which are only used one time in revision 1553174. 
In revision 1553175 a added some code to release used resources when those are 
no longer needed.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-23 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855891#comment-13855891
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


Hello,

thank you for your work on this problem.

I am currently on vacation but when I returned to work I test your changes and 
you'd be back.

Have a great holiday.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-23 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855917#comment-13855917
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

@Andreas: it no longer builds. I believe that the problem is in 
testCreateEmptyPdf(), pdfDoc.close(); is too early.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855947#comment-13855947
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


Yes, I missed that, sorry my fault. I'll have a look

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855952#comment-13855952
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I fixed the test in revision 1553220. Thanks for the pointer!

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Assignee: Andreas Lehmkühler
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855238#comment-13855238
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


PDFBOX-1777 should already release some of the resources at the end of parsing.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-22 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855253#comment-13855253
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

I repeated the test mentioned in 14/Dec/13 22:29, the output is slightly 
better:

START - Total memory (Mo): 128.0
strip size: 2595883
PDDocument close - Total memory (Mo): 903.0
strip size: 2595883
PDDocument close - Total memory (Mo): 914.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1020.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1070.0
strip size: 2595883
PDDocument close - Total memory (Mo): 1074.0
After sleep - Total memory (Mo): 1076.0
After sleep - Total memory (Mo): 1076.0
After sleep - Total memory (Mo): 1076.0


 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855294#comment-13855294
 ] 

Andreas Lehmkühler commented on PDFBOX-1808:


I made some other changes ready to be committed but I have some issues with my 
local repository. Those changes will definitely improve the memory footprint.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849213#comment-13849213
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

The nps file is the most interesting: look at the live columns only. Yes 
you'll find much strings and chars. Most, I believe, are the strings from the 
text stripping. Now sort by name and look at org.apache.fontbox* and 
org.apache.pdfbox.*, again only at the live colums. At the few places where 
the number was != 0, I was able to find static declarations, e.g. maps to keep 
objects to speed things up. What I also found (and is related to what you 
found) is that there are static maps, they are found in COSName and in PDFont. 
IF MY THEORY is correct - then you might want to try to call 
COSName.clearResources() and PDFont.clearResources() after strips to see if it 
gets better. Of course your software will be slower.

(The sad news, for me, is that the NB profiler isn't telling the whole story - 
it claims that createFont is calling PDFontclinit but doesn't tell the calls 
between)

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849257#comment-13849257
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


I try this:
PDDocument cd = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
retour = stripper.getText(cd);
COSName.clearResources();
PDFont.clearResources();

But this is the result:
PDDocument.load - Total memory (Mo): 747.0
PDFTextStripper.getText - Total memory (Mo): 747.0
COSName.clearResources - Total memory (Mo): 747.0
PDFont.clearResources - Total memory (Mo): 747.0

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849340#comment-13849340
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


When a file get me this message into the output window of netbeans:
déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 1550 is wrong. Fall back to reading stream 
until 'endstream'.

the memory is increase. Do you know why?


Logs:
-- START - Total memory (Mo): 95.0
-- File : D:\Armoires\DEVEARM\mphh\ocr\2\1450\2 - SITUATION\AUTRES ELEMENTS DE 
SITUATION\Reprise adulte_001.pdf
déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 1550 is wrong. Fall back to reading stream 
until 'endstream'.
- PDFParser.getPDDocument - Total memory (Mo): 95.0
- PDFTextStripper.getText - Total memory (Mo): 121.0
- ALL closes - Total memory (Mo): 121.0

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Maruan Sahyoun (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849344#comment-13849344
 ] 

Maruan Sahyoun commented on PDFBOX-1808:


Hi,

could you try using PDDocument.loadNonSeq instead of PDDocument.load? 
loadNonSeq parses PDFs following the Xref entries (which is inline with the PDF 
spec) whereas load parses sequentially which can lead to errors such as the 
last one you are reporting.

BR
Maruan

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849353#comment-13849353
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

Re your comment after having tried clearResources() - you didn't clean up the 
stripper itself and all the rest, which you did in your test program. Plus, you 
should look at it with the profiler like you did last time.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-16 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849737#comment-13849737
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


I try the PDDocument.loadNonSeq with an randomaccessfile in the second 
parameter.
No change in the memory.
When i load the file DOSSIER DE CANDIDATURE_001.pdf the memory is up from 
120Mo to 750Mo and the memory keep this size.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
 Attachments: 1808-java char copyof.jpg, 1808-java char 
 copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
 s50-1.png, s50-2.png

   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-14 Thread Guyenot Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848342#comment-13848342
 ] 

Guyenot Jeremy commented on PDFBOX-1808:


Hello,

after more tests i find some case where the memory leaks:
1) after extracting text from certain pdf the memory is not free
START - Total memory (Mo): 468.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\DOSSIER DE 
CANDIDATURE_001.pdf
-- File size (ko): 4975.0
- PDDocument.load - Total memory (Mo): 468.0
- PDDocument.getNumberOfPages : 2676
- PDFTextStripper.getText - Total memory (Mo): 747.0
START - Total memory (Mo): 745.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\4 - EVALUATION ET 
BILANS\BILAN SOCIAL\Reprise adulte_001.pdf
-- File size (ko): 79.0
-- File size (Mo): 0.0
- PDDocument.load - Total memory (Mo): 745.0
- PDDocument.getNumberOfPages : 2
- PDFTextStripper.getText - Total memory (Mo): 745.0

2) on certain other i find this:
START - Total memory (Mo): 268.0
-- File : D:\Armoires\DEVEARM\mphh\ocr\188\94458\4 - EVALUATION ET 
BILANS\ADMISSION URGENCE\Reprise adulte_003.pdf
-- File size (ko): 1183.0
-- File size (Mo): 1.0
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2110 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 1286 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 706 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 420 is wrong. Fall back to reading stream 
until 'endstream'.
déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 936 is wrong. Fall back to reading stream 
until 'endstream'.
- PDDocument.load - Total memory (Mo): 268.0
- PDDocument.getNumberOfPages : 41
- PDFTextStripper.getText - Total memory (Mo): 469.0
START - Total memory (Mo): 469.0
-- File : D:\Armoires\DEVEARM\mphh\image\167\83545\0 - 
INSTRUCTION\RECEVABILITE\AR Complet_001.pdf
-- File size (ko): 115.0
-- File size (Mo): 0.0
- PDDocument.load - Total memory (Mo): 469.0
- PDDocument.getNumberOfPages : 3
- PDFTextStripper.getText - Total memory (Mo): 469.0

You can see that the memory is not free after use.
I can't give you my pdf files because they contained some personnals 
informations.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  +

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

2013-12-13 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13847891#comment-13847891
 ] 

Tilman Hausherr commented on PDFBOX-1808:
-

What happens if you extract from several PDFs in the software, or of the same 
PDF several times? Is there more and more memory used? Or does it stay the same?

I'm asking this to clarify wether 1) pdfbox is just using a lot of memory or 2) 
pdfbox has memory leaks.

If you are using netbeans, the profiler has some cool features. It helped me 
find a bug in PDFBOX-1694.

 PDFTextStripper.getText - hight memory usage
 

 Key: PDFBOX-1808
 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.2, 1.8.3
 Environment: Windows 7
 Java jdk 1.7.0_45
Reporter: Guyenot Jeremy
Priority: Critical
  Labels: performance
   Original Estimate: 72h
  Remaining Estimate: 72h

 Hello,
 i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
 use a lot of memory.
 With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
 I also constat that the memory is'nt free after the getText method is called.
 You can see my code bellow:
 double virgule = Math.pow(10, 2);
   System.out.println(START - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 PDDocument cd = PDDocument.load(file);
   System.out.println(PDDocument getNumberOfPages - Nombre de 
 pages:  + cd.getNumberOfPages());
   System.out.println(PDDocument load - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 String pdfText = ;
 try{
   PDFTextStripper stripper = new PDFTextStripper();
   pdfText = stripper.getText(cd);
   System.out.println(PDFTextStripper getText - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   stripper.resetEngine();
   stripper = null;
   System.out.println(PDFTextStripper resetEngine - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
 }
 finally{
   if( cd!=null ){
   cd.close();
   cd = null;
   System.out.println(PDDocument close - Total 
 memory (Mo):  + Math.round((Runtime.getRuntime().totalMemory()/100) * 
 virgule) / virgule);
   }
 }
 retour = new TextField(fieldName, pdfText, Field.Store.NO);
   System.out.println(TextField - Total memory (Mo):  + 
 Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule);
 And the result into my output window:
 START - Total memory (Mo): 95.0
 PDDocument getNumberOfPages - Nombre de pages: 2676
 PDDocument load - Total memory (Mo): 121.0
 PDFTextStripper getText - Total memory (Mo): 757.0
 PDFTextStripper resetEngine - Total memory (Mo): 757.0
 PDDocument close - Total memory (Mo): 757.0
 TextField - Total memory (Mo): 757.0
 pdfText - Total memory (Mo): 757.0
 I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

31 matches

Site Navigation

Mail list logo

Footer information