[
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088003#comment-13088003
]
Jeremy Villalobos commented on PDFBOX-1104:
-------------------------------------------
@Martinez
The improvement speeds up the parsing of the pdf document, which ultimately
benefits the text extraction, but it should be useful for the other tools.
When you use PDFTextStripper.setStartPage(int) at that point you already called
PDDocument.load( file ). The load() method will parse the entire PDF document
and put it into memory, alternatively putting some of the data structure back
into a file through the use of the scratch file.
The modification to the parser shown here will only parse the PDF document for
the xref table, the catalog object, and it will traverse the page tree to get
the desired page. Once it reaches the page object, it will load all the
COSObjects for the needed page. This reduces the time it takes to parse the
file, which should compound onto any improvements done to the TextExtraction
class. This parsing approach also complies with PDF specification.
Notice I only modified the TextExtraction class to be compatible with the
faster parser. The PDFTextStripper will still look for data in the COSDocument
tree that has not been parsed by the improved parser and it will create a
NullPointerException. The modified version OnePagePDFTextStripper simply
focuses on just the PDPage given by the programmer so there is no
NullPointerException.
I am glad to see some interest. This improvement was added to benefit an
Android version of pdfbox, in that environment the improvement (using large
pdfs) is noticeable. For the desktop versions, you do need to measure it to
notice any improvement since desktops overkill on resources :-)
I could use some help on how to run the official battery of pdf tests to verify
this parser improvement can handle all the files the current parser can handle.
> Improves parsing speed of a pdf by an average of 45% when extracting text
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1104
> URL: https://issues.apache.org/jira/browse/PDFBOX-1104
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Utilities
> Affects Versions: 1.6.0
> Reporter: Jeremy Villalobos
> Priority: Minor
> Fix For: 1.6.0
>
> Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java,
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file
> according to PDF specifications. A random page can be parsed without having
> to parse the entire document first. Exist parsing code was used to transfer
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool. But has not been
> tested with the viewer or other pdf tools. Some tools may need to be recoded
> to use the parser to prevent null pointer exceptions since the COSDocument
> will contain null pointers for COSObjects that have not been parsed. For
> example, the Current Text Extractor assumes the entire document is loaded.
> On this code submission a modified text extractor is also included with the
> name OnePagePDFTextStripper. The class has a function that will extract the
> text from a PDPage submitted by the programmer.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira