[ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756825#action_12756825 ]
Navendu Garg commented on PDFBOX-533: ------------------------------------- I just extracted text from a 50 MB using the old version and it took approx 20 seconds to convert to text. Please I am doing other stuff too in between. I admit this is a crude benchmark. Unfortunately, PDFBox 0.8.0-incubating version crashed with this error on this file. org.apache.pdfbox.exceptions.WrappedIOException at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779) at org.apache.pdfbox.util.TestLargePDF.test(TestLargePDF.java:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:592) at junit.framework.TestCase.runTest(TestCase.java:164) at junit.framework.TestCase.runBare(TestCase.java:130) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:120) at junit.framework.TestSuite.runTest(TestSuite.java:230) at junit.framework.TestSuite.run(TestSuite.java:225) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.util.NoSuchElementException at java.util.AbstractList$Itr.next(AbstractList.java:427) at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115) at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) ... 22 more Still, I think writeCharacters() method will not affect the performance all that much. > PDFTextStripper.writeCharacters is called no where in the class > --------------------------------------------------------------- > > Key: PDFBOX-533 > URL: https://issues.apache.org/jira/browse/PDFBOX-533 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Navendu Garg > > It seems writeCharacters method is not called anywhere in the PDFTextStripper > class. This makes it impossible for handling character TextPosition as well > as Line Separator because processLineSeparator method is no longer there and > writeLineSeparator is called when actual writing happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.