[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Lars Torunski (JIRA) Wed, 11 Nov 2009 10:57:12 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776571#action_12776571
 ]


Lars Torunski commented on PDFBOX-556:
--------------------------------------

In general the readUntilEndStream method needs 5-10% during the indexing 
process. Using the tracing option of the CPU profiling with YourKit increases 
the percentance, because the profiler has overhead counting the invocation of 
different included methods which e.g. return only one character.

The screenshot was taken during the first part of the parsing process. During 
this time 70% was spent in the readUntilEndStream method.

In common uses cases about 70% is spent in PDFTextStripper.writeText and 15% in 
PDDocument.load, which last methods includes readUntilEndStream.

> Performance regression from 0.7.3 to 0.8.0
> ------------------------------------------
>
>                 Key: PDFBOX-556
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-556
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Lars Torunski
>         Attachments: screenshot-1.jpg
>
>
> After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes 
> a lot longer than expected.
> E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version 
> 0.7.3 ==>  +50%
> My first thought was that more pdfs are indexed or even indexed correctly 
> with 0.8.0. But that shouldn't be an impact more than 50%.
> Profiling with YourKit shows that a lot of time is spent in the method 
> BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe 
> somebody find out how to improve the performance here.
> The method readUntilEndStream handles endobj tags in the stream also which 
> impacts of course the performance, but this is OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-556) Performance regression from 0.7.3 to 0.8.0

Reply via email to