[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893517#comment-13893517 ] Guyenot Jeremy commented on PDFBOX-1808: When i use my windows task manager this is the same result. After a long moment the memory is release. I try with the 1.8.4.SNAPSHOT. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Fix For: 1.8.4, 2.0.0 Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893672#comment-13893672 ] John Hewson commented on PDFBOX-1808: - [~jguyenot], perhaps the JVM settings in http://stackoverflow.com/a/4625142 will help you. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Fix For: 1.8.4, 2.0.0 Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878461#comment-13878461 ] Timo Boehme commented on PDFBOX-1808: - [~jguyenot] please inform yourself about the meaning of the memory statistics provided by Java. *Total memory* is (as the name says) all the memory the VM uses. What you want is the used memory (by your application). This has to be calculated by totalMem - freeMem (see e.g. http://stackoverflow.com/questions/3571203/what-is-the-exact-meaning-of-runtime-getruntime-totalmemory-and-freememory) PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878466#comment-13878466 ] Timo Boehme commented on PDFBOX-1808: - one addition to my last comment: it is JVM implementation dependent if in case of large free memory the JVM will release memory to the operating system. In case of server VMs they typically keep the allocated memory - independent if the Java application still needs it. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, Screenshot2014-01-21-19-51-24.png, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876343#comment-13876343 ] Guyenot Jeremy commented on PDFBOX-1808: Hello, my netbeans project is like in netbeans_project.jpg. So how can i integrated the sources of http://svn.apache.org/repos/asf/pdfbox/trunk/; in place of the pdfbox-app-1.8.3.jar? PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876401#comment-13876401 ] Tilman Hausherr commented on PDFBOX-1808: - With netbeans, do Team, subversion, checkout enter http://svn.apache.org/repos/asf/pdfbox/trunk next replace pdfbox/trunk with pdfbox/branches/1.8 finish ...wait some time... netbeans will ask you to create a project, do so. At the next question, click on pdfbox reactor only. It will create a project pdfbox reactor. The do a build on that one. The jar files you need will be in the xxx\target directories. Or all together in 1.8\war\target\pdfbox-war-2.0.0-SNAPSHOT\WEB-INF\lib If you can't do it at work because of firewall, do it from home. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876438#comment-13876438 ] Guyenot Jeremy commented on PDFBOX-1808: Thanks Tilman Hausherr, I create a project and compile. I added the neww jar files to my library folder and i test. This is the result: # File : DOSSIER DE CANDIDATURE_001.pdf # START - Total memory (Mo): 167.0 # PDDocument getNumberOfPages - Nombre de pages: 2676 # PDDocument load - Total memory (Mo): 167.0 # PDFTextStripper getText - Total memory (Mo): 706.0 # PDDocument close - Total memory (Mo): 706.0 You can see that the memory is not released after treatment. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, netbeans_project.jpg, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13873111#comment-13873111 ] Andreas Lehmkühler commented on PDFBOX-1808: I added most of the changes (excluding 1553174 as it introduces an api incompatibility) to the 1.8 branch in revision 1558705. [~jguyenot] Any luck with your test? Do you need some additional help/information? PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870503#comment-13870503 ] Guyenot Jeremy commented on PDFBOX-1808: Hello, thank you for your response. I try to integrate your changes into my project but i use the .jar. Is it possible to have version jar of all changes to the Netbeans integrated into my project? PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870556#comment-13870556 ] Tilman Hausherr commented on PDFBOX-1808: - Can you access the svn repository from where you are now, or is your firewall preventing to do it? The url is http://svn.apache.org/repos/asf/pdfbox/trunk/ (see http://pdfbox.apache.org/downloads.html#scm ) PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870566#comment-13870566 ] Andreas Lehmkühler commented on PDFBOX-1808: [~jguyenot] There is no official release containing those changes and we don't have any plans to release one yet. If you are using maven, you can adress the SNAPSHOT versions (1.8.4-SNAPSHOT or 2.0.0-SNAPSHOT) of PDFBox or simply download the latest SNAPSHOT build from [nexus|https://repository.apache.org/index.html]. But be aware that these are unstable builds. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868761#comment-13868761 ] Andreas Lehmkühler commented on PDFBOX-1808: I added a null check to PDDocument#close in revision 1557374 to avoid a NPE when splitting a pdf PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862257#comment-13862257 ] Tilman Hausherr commented on PDFBOX-1808: - Lets just wait until Jeremy is back from vacation. My own tests show a higher usage (?!) but I don't trust java on telling the truth, and there were other changes since then. I didn't save my own memory .nps snapshot so I can't compare. Your code changes looked like it should have improved. START - Total memory (Mo): 128.0 strip size: 2595883 PDDocument close - Total memory (Mo): 926.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1026.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1027.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1021.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1171.0 After sleep - Total memory (Mo): 1174.0 After sleep - Total memory (Mo): 1174.0 PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862278#comment-13862278 ] Andreas Lehmkühler commented on PDFBOX-1808: I'm using Java VisualVM (it's part of the jdk) as profiler. It has a lot of monitoring features, e.g. one can see all living objects so that it is simply possible to see if those can be finalized or not. In my environment all objects were released, at least after triggering the GC. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861759#comment-13861759 ] Andreas Lehmkühler commented on PDFBOX-1808: Can anybody confirm, that my changes are working well? I'd like to add those to the 1.8 branch as well. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855838#comment-13855838 ] Andreas Lehmkühler commented on PDFBOX-1808: I removed some cached values which are only used one time in revision 1553174. In revision 1553175 a added some code to release used resources when those are no longer needed. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855891#comment-13855891 ] Guyenot Jeremy commented on PDFBOX-1808: Hello, thank you for your work on this problem. I am currently on vacation but when I returned to work I test your changes and you'd be back. Have a great holiday. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855917#comment-13855917 ] Tilman Hausherr commented on PDFBOX-1808: - @Andreas: it no longer builds. I believe that the problem is in testCreateEmptyPdf(), pdfDoc.close(); is too early. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855947#comment-13855947 ] Andreas Lehmkühler commented on PDFBOX-1808: Yes, I missed that, sorry my fault. I'll have a look PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855952#comment-13855952 ] Andreas Lehmkühler commented on PDFBOX-1808: I fixed the test in revision 1553220. Thanks for the pointer! PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Assignee: Andreas Lehmkühler Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855238#comment-13855238 ] Andreas Lehmkühler commented on PDFBOX-1808: PDFBOX-1777 should already release some of the resources at the end of parsing. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855253#comment-13855253 ] Tilman Hausherr commented on PDFBOX-1808: - I repeated the test mentioned in 14/Dec/13 22:29, the output is slightly better: START - Total memory (Mo): 128.0 strip size: 2595883 PDDocument close - Total memory (Mo): 903.0 strip size: 2595883 PDDocument close - Total memory (Mo): 914.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1020.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1070.0 strip size: 2595883 PDDocument close - Total memory (Mo): 1074.0 After sleep - Total memory (Mo): 1076.0 After sleep - Total memory (Mo): 1076.0 After sleep - Total memory (Mo): 1076.0 PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855294#comment-13855294 ] Andreas Lehmkühler commented on PDFBOX-1808: I made some other changes ready to be committed but I have some issues with my local repository. Those changes will definitely improve the memory footprint. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849213#comment-13849213 ] Tilman Hausherr commented on PDFBOX-1808: - The nps file is the most interesting: look at the live columns only. Yes you'll find much strings and chars. Most, I believe, are the strings from the text stripping. Now sort by name and look at org.apache.fontbox* and org.apache.pdfbox.*, again only at the live colums. At the few places where the number was != 0, I was able to find static declarations, e.g. maps to keep objects to speed things up. What I also found (and is related to what you found) is that there are static maps, they are found in COSName and in PDFont. IF MY THEORY is correct - then you might want to try to call COSName.clearResources() and PDFont.clearResources() after strips to see if it gets better. Of course your software will be slower. (The sad news, for me, is that the NB profiler isn't telling the whole story - it claims that createFont is calling PDFontclinit but doesn't tell the calls between) PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849257#comment-13849257 ] Guyenot Jeremy commented on PDFBOX-1808: I try this: PDDocument cd = PDDocument.load(file); PDFTextStripper stripper = new PDFTextStripper(); retour = stripper.getText(cd); COSName.clearResources(); PDFont.clearResources(); But this is the result: PDDocument.load - Total memory (Mo): 747.0 PDFTextStripper.getText - Total memory (Mo): 747.0 COSName.clearResources - Total memory (Mo): 747.0 PDFont.clearResources - Total memory (Mo): 747.0 PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849340#comment-13849340 ] Guyenot Jeremy commented on PDFBOX-1808: When a file get me this message into the output window of netbeans: déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 1550 is wrong. Fall back to reading stream until 'endstream'. the memory is increase. Do you know why? Logs: -- START - Total memory (Mo): 95.0 -- File : D:\Armoires\DEVEARM\mphh\ocr\2\1450\2 - SITUATION\AUTRES ELEMENTS DE SITUATION\Reprise adulte_001.pdf déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 1550 is wrong. Fall back to reading stream until 'endstream'. - PDFParser.getPDDocument - Total memory (Mo): 95.0 - PDFTextStripper.getText - Total memory (Mo): 121.0 - ALL closes - Total memory (Mo): 121.0 PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849344#comment-13849344 ] Maruan Sahyoun commented on PDFBOX-1808: Hi, could you try using PDDocument.loadNonSeq instead of PDDocument.load? loadNonSeq parses PDFs following the Xref entries (which is inline with the PDF spec) whereas load parses sequentially which can lead to errors such as the last one you are reporting. BR Maruan PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849353#comment-13849353 ] Tilman Hausherr commented on PDFBOX-1808: - Re your comment after having tried clearResources() - you didn't clean up the stripper itself and all the rest, which you did in your test program. Plus, you should look at it with the profiler like you did last time. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849737#comment-13849737 ] Guyenot Jeremy commented on PDFBOX-1808: I try the PDDocument.loadNonSeq with an randomaccessfile in the second parameter. No change in the memory. When i load the file DOSSIER DE CANDIDATURE_001.pdf the memory is up from 120Mo to 750Mo and the memory keep this size. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Attachments: 1808-java char copyof.jpg, 1808-java char copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, s50-1.png, s50-2.png Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848342#comment-13848342 ] Guyenot Jeremy commented on PDFBOX-1808: Hello, after more tests i find some case where the memory leaks: 1) after extracting text from certain pdf the memory is not free START - Total memory (Mo): 468.0 -- File : D:\Armoires\DEVEARM\mphh\image\167\83545\DOSSIER DE CANDIDATURE_001.pdf -- File size (ko): 4975.0 - PDDocument.load - Total memory (Mo): 468.0 - PDDocument.getNumberOfPages : 2676 - PDFTextStripper.getText - Total memory (Mo): 747.0 START - Total memory (Mo): 745.0 -- File : D:\Armoires\DEVEARM\mphh\image\167\83545\4 - EVALUATION ET BILANS\BILAN SOCIAL\Reprise adulte_001.pdf -- File size (ko): 79.0 -- File size (Mo): 0.0 - PDDocument.load - Total memory (Mo): 745.0 - PDDocument.getNumberOfPages : 2 - PDFTextStripper.getText - Total memory (Mo): 745.0 2) on certain other i find this: START - Total memory (Mo): 268.0 -- File : D:\Armoires\DEVEARM\mphh\ocr\188\94458\4 - EVALUATION ET BILANS\ADMISSION URGENCE\Reprise adulte_003.pdf -- File size (ko): 1183.0 -- File size (Mo): 1.0 déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 2110 is wrong. Fall back to reading stream until 'endstream'. déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 1286 is wrong. Fall back to reading stream until 'endstream'. déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 706 is wrong. Fall back to reading stream until 'endstream'. déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 420 is wrong. Fall back to reading stream until 'endstream'. déc. 14, 2013 12:55:43 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 936 is wrong. Fall back to reading stream until 'endstream'. - PDDocument.load - Total memory (Mo): 268.0 - PDDocument.getNumberOfPages : 41 - PDFTextStripper.getText - Total memory (Mo): 469.0 START - Total memory (Mo): 469.0 -- File : D:\Armoires\DEVEARM\mphh\image\167\83545\0 - INSTRUCTION\RECEVABILITE\AR Complet_001.pdf -- File size (ko): 115.0 -- File size (Mo): 0.0 - PDDocument.load - Total memory (Mo): 469.0 - PDDocument.getNumberOfPages : 3 - PDFTextStripper.getText - Total memory (Mo): 469.0 You can see that the memory is not free after use. I can't give you my pdf files because they contained some personnals informations. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): +
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13847891#comment-13847891 ] Tilman Hausherr commented on PDFBOX-1808: - What happens if you extract from several PDFs in the software, or of the same PDF several times? Is there more and more memory used? Or does it stay the same? I'm asking this to clarify wether 1) pdfbox is just using a lot of memory or 2) pdfbox has memory leaks. If you are using netbeans, the profiler has some cool features. It helped me find a bug in PDFBOX-1694. PDFTextStripper.getText - hight memory usage Key: PDFBOX-1808 URL: https://issues.apache.org/jira/browse/PDFBOX-1808 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.2, 1.8.3 Environment: Windows 7 Java jdk 1.7.0_45 Reporter: Guyenot Jeremy Priority: Critical Labels: performance Original Estimate: 72h Remaining Estimate: 72h Hello, i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory. With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. I also constat that the memory is'nt free after the getText method is called. You can see my code bellow: double virgule = Math.pow(10, 2); System.out.println(START - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); PDDocument cd = PDDocument.load(file); System.out.println(PDDocument getNumberOfPages - Nombre de pages: + cd.getNumberOfPages()); System.out.println(PDDocument load - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); String pdfText = ; try{ PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(cd); System.out.println(PDFTextStripper getText - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); stripper.resetEngine(); stripper = null; System.out.println(PDFTextStripper resetEngine - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } finally{ if( cd!=null ){ cd.close(); cd = null; System.out.println(PDDocument close - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); } } retour = new TextField(fieldName, pdfText, Field.Store.NO); System.out.println(TextField - Total memory (Mo): + Math.round((Runtime.getRuntime().totalMemory()/100) * virgule) / virgule); And the result into my output window: START - Total memory (Mo): 95.0 PDDocument getNumberOfPages - Nombre de pages: 2676 PDDocument load - Total memory (Mo): 121.0 PDFTextStripper getText - Total memory (Mo): 757.0 PDFTextStripper resetEngine - Total memory (Mo): 757.0 PDDocument close - Total memory (Mo): 757.0 TextField - Total memory (Mo): 757.0 pdfText - Total memory (Mo): 757.0 I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)