Stream parsing performance improvement + patch
----------------------------------------------

                 Key: PDFBOX-1175
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1175
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.7.0
            Reporter: Timo Boehme
            Priority: Minor


Stream parsing is one of the critical parts looked from a performance point of 
view since typically most data is stored in streams. While PDFBOX already got 
some speedup some time ago in the method copying stream data from file to 
random access buffer (BaseParser#readUntilEndStream) there is some room for 
improvement.

The problem with the current implementation is the byte wise reading and 
writing of the data. I have rewritten the method using byte arrays for IO and 
optimized the number of needed comparisons for finding 'endstream'/'endobj'. 
This results in 7-8 times faster parsing of streams and a 3-4 times faster 
parsing of a normal 10 page PDF.

See the attached file which is a drop in replacement for the readUntilEndStream 
method in BaseParser.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to