[ 
https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-893.
---------------------------------------

    Resolution: Fixed
      Assignee: Andreas Lehmkühler

I added the patch in revision 1044823 as proposed by Neil McErlean. I made some 
minor tweaks to the PDStreamEngine part, as some code was altered before.

Thanks for the contribution!!

> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-893
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 1.3.1
>         Environment: All
>            Reporter: Neil McErlean
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.4.0
>
>         Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF 
> documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes 
> a lot more than that.
>     (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime 
> Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m 
> -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction 
> and I see that a lot of time is spent constructing 
> org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox 
> during text extraction for a large document. When this screenshot was taken 
> the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time 
> apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within 
> PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and 
> then a further 7 within a loop which is called for each character in the 
> document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it 
> now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix, 
> Matrix) which allows you to multiply two matrices and have the result stored 
> in a specified Matrix object. This has the effect of reducing the number of 
> temporary Matrix objects created during multiplication within 
> PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now 
> accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't 
> have an appreciable effect on the time it takes to extract text from my large 
> documents.
> The profiling continues...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to