Improves parsing speed of a pdf by an average of 45% when extracting text from 
one random page in the document.
---------------------------------------------------------------------------------------------------------------

                 Key: PDFBOX-1104
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing, Utilities
    Affects Versions: 1.6.0
            Reporter: Jeremy Villalobos
            Priority: Minor


The parser proposed just parses the minimal required from the PDF file 
according to PDF specifications.  A random page can be parsed without having to 
parse the entire document first.  Exist parsing code was used to transfer 
existing bugfixes and compliance fixes to this parser.

The parser has been tested with the text extraction tool.  But has not been 
tested with the viewer or other pdf tools.  Some tools may need to be recoded 
to use the parser to prevent null pointer exceptions since the COSDocument will 
contain null pointers for COSObjects that have not been parsed.  For example, 
the Current Text Extractor assumes the entire document is loaded.  On this code 
submission a modified text extractor is also included with the name 
OnePagePDFTextStripper.  The class has a function that will extract the text 
from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to