[ https://issues.apache.org/jira/browse/PDFBOX-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776571#action_12776571 ]
Lars Torunski commented on PDFBOX-556: -------------------------------------- In general the readUntilEndStream method needs 5-10% during the indexing process. Using the tracing option of the CPU profiling with YourKit increases the percentance, because the profiler has overhead counting the invocation of different included methods which e.g. return only one character. The screenshot was taken during the first part of the parsing process. During this time 70% was spent in the readUntilEndStream method. In common uses cases about 70% is spent in PDFTextStripper.writeText and 15% in PDDocument.load, which last methods includes readUntilEndStream. > Performance regression from 0.7.3 to 0.8.0 > ------------------------------------------ > > Key: PDFBOX-556 > URL: https://issues.apache.org/jira/browse/PDFBOX-556 > Project: PDFBox > Issue Type: Improvement > Components: Parsing > Affects Versions: 0.8.0-incubator > Reporter: Lars Torunski > Attachments: screenshot-1.jpg > > > After upgrading from version 0.7.3 to 0.8.0 our pdf indexing for lucene takes > a lot longer than expected. > E.g. a single pdf needs 1150ms to be indexed compared to 750ms with version > 0.7.3 ==> +50% > My first thought was that more pdfs are indexed or even indexed correctly > with 0.8.0. But that shouldn't be an impact more than 50%. > Profiling with YourKit shows that a lot of time is spent in the method > BaseParser.readUntilEndStream and it's invocation of cmpCircularBuffer. Maybe > somebody find out how to improve the performance here. > The method readUntilEndStream handles endobj tags in the stream also which > impacts of course the performance, but this is OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.