[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Jeremy Villalobos (JIRA) Fri, 19 Aug 2011 14:47:53 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088003#comment-13088003
 ]


Jeremy Villalobos commented on PDFBOX-1104:
-------------------------------------------

@Martinez

The improvement speeds up the parsing of the pdf document, which ultimately 
benefits the text extraction, but it should be useful for the other tools.  

When you use PDFTextStripper.setStartPage(int) at that point you already called 
PDDocument.load( file ).  The load() method will parse the entire PDF document 
and put it into memory, alternatively putting some of the data structure back 
into a file through the use of the scratch file.   

The modification to the parser shown here will only parse the PDF document for 
the xref table, the catalog object, and it will traverse the page tree to get 
the desired page.  Once it reaches the page object, it will load all the 
COSObjects for the needed page.  This reduces the time it takes to parse the 
file, which should compound onto any improvements done to the TextExtraction 
class.  This parsing approach also complies with PDF specification.

Notice I only modified the TextExtraction class to be compatible with the 
faster parser.  The PDFTextStripper will still look for data in the COSDocument 
tree that has not been parsed by the improved parser and it will create a 
NullPointerException.  The modified version OnePagePDFTextStripper simply 
focuses on just the PDPage given by the programmer so there is no 
NullPointerException.

I am glad to see some interest.  This improvement was added to benefit an 
Android version of pdfbox, in that environment the improvement (using large 
pdfs) is noticeable.  For the desktop versions, you do need to measure it to 
notice any improvement since desktops overkill on resources :-)

I could use some help on how to run the official battery of pdf tests to verify 
this parser improvement can handle all the files the current parser can handle.

> Improves parsing speed of a pdf by an average of 45% when extracting text 
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, 
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file 
> according to PDF specifications.  A random page can be parsed without having 
> to parse the entire document first.  Exist parsing code was used to transfer 
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been 
> tested with the viewer or other pdf tools.  Some tools may need to be recoded 
> to use the parser to prevent null pointer exceptions since the COSDocument 
> will contain null pointers for COSObjects that have not been parsed.  For 
> example, the Current Text Extractor assumes the entire document is loaded.  
> On this code submission a modified text extractor is also included with the 
> name OnePagePDFTextStripper.  The class has a function that will extract the 
> text from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Reply via email to