[
https://issues.apache.org/jira/browse/PDFBOX-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timo Boehme updated PDFBOX-1175:
--------------------------------
Attachment: BaseParser_readUntilEndStream.java
the optimized method (BaseParser#readUntilEndStream) for copying stream data
from file to random buffer
> Stream parsing performance improvement + patch
> ----------------------------------------------
>
> Key: PDFBOX-1175
> URL: https://issues.apache.org/jira/browse/PDFBOX-1175
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 1.7.0
> Reporter: Timo Boehme
> Priority: Minor
> Attachments: BaseParser_readUntilEndStream.java
>
>
> Stream parsing is one of the critical parts looked from a performance point
> of view since typically most data is stored in streams. While PDFBOX already
> got some speedup some time ago in the method copying stream data from file to
> random access buffer (BaseParser#readUntilEndStream) there is some room for
> improvement.
> The problem with the current implementation is the byte wise reading and
> writing of the data. I have rewritten the method using byte arrays for IO and
> optimized the number of needed comparisons for finding 'endstream'/'endobj'.
> This results in 7-8 times faster parsing of streams and a 3-4 times faster
> parsing of a normal 10 page PDF.
> See the attached file which is a drop in replacement for the
> readUntilEndStream method in BaseParser.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira