[ 
https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756825#action_12756825
 ] 

Navendu Garg commented on PDFBOX-533:
-------------------------------------

I just extracted text from a 50 MB using the old version and it took approx 20 
seconds to convert to text. Please I am doing other stuff too in between. I 
admit this is a crude benchmark. Unfortunately, PDFBox 0.8.0-incubating version 
crashed with this error on this file.

org.apache.pdfbox.exceptions.WrappedIOException
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779)
        at org.apache.pdfbox.util.TestLargePDF.test(TestLargePDF.java:13)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:592)
        at junit.framework.TestCase.runTest(TestCase.java:164)
        at junit.framework.TestCase.runBare(TestCase.java:130)
        at junit.framework.TestResult$1.protect(TestResult.java:106)
        at junit.framework.TestResult.runProtected(TestResult.java:124)
        at junit.framework.TestResult.run(TestResult.java:109)
        at junit.framework.TestCase.run(TestCase.java:120)
        at junit.framework.TestSuite.runTest(TestSuite.java:230)
        at junit.framework.TestSuite.run(TestSuite.java:225)
        at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
        at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
        at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
        at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
        at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
        at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.util.NoSuchElementException
        at java.util.AbstractList$Itr.next(AbstractList.java:427)
        at 
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
        at 
org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
        ... 22 more

Still, I think writeCharacters() method will not affect the performance all 
that much. 

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper 
> class. This makes it impossible for handling character TextPosition as well 
> as Line Separator because processLineSeparator method is no longer there and 
> writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to