[
https://issues.apache.org/jira/browse/PDFBOX-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neil McErlean updated PDFBOX-893:
---------------------------------
Attachment: PDFBOX_perf_patch.txt
Patch file
> Performance improvement in PDFStreamEngine and Matrix (patch included)
> ----------------------------------------------------------------------
>
> Key: PDFBOX-893
> URL: https://issues.apache.org/jira/browse/PDFBOX-893
> Project: PDFBox
> Issue Type: Improvement
> Components: Utilities
> Affects Versions: 1.3.1
> Environment: All
> Reporter: Neil McErlean
> Fix For: 1.4.0
>
> Attachments: PDFBOX_perf_patch.txt
>
>
> I've been profiling PDFBox during text extraction from some large PDF
> documents e.g. 2000 pages, mostly text, 20 Mb file size.
> Some of these documents can take a long time to process e.g. 40s+, sometimes
> a lot more than that.
> (I'm using a 2.5 GHz, 4 Gb, Mac OS X 10.5.8, Java(TM) SE Runtime
> Environment (build 1.6.0_22-b04-307-9M3263) with -Xms256m -Xmx1024m
> -XX:PermSize=256m)
> I've begun by profiling where the code spends its time during text extraction
> and I see that a lot of time is spent constructing
> org.apache.pdfbox.util.Matrix objects.
> Screenshot PDFReference_nopatch.tiff shows the most used methods in PDFBox
> during text extraction for a large document. When this screenshot was taken
> the percentages had stabilised and Matrix.<init> accounts for 40% of cpu time
> apparently - the largest time of any method. I was surprised.
> Most of these Matrix instances are being constructed within
> PDFStreamEngine.prcoessEncodedText(byte[])
> On revision 1035639 (pre-1.4.0) this method constructs one Matrix object and
> then a further 7 within a loop which is called for each character in the
> document. So that's a lot of Matrix objects.
> The attached patch refactors PDFStreamEngine.processEncodedText so that it
> now creates 5 reusable Matrix instances outside the loop and 2 within it.
> This was achieved by adding a new method to Matrix: Matrix.multiply(Matrix,
> Matrix) which allows you to multiply two matrices and have the result stored
> in a specified Matrix object. This has the effect of reducing the number of
> temporary Matrix objects created during multiplication within
> PDFStreamEngine. This should save the garbage collector some work.
> I profiled PDFBox again with this patch included and Matrix.<init> now
> accounts for only 30% of the cpu time.
> Unfortunately, whilst less temporary objects are being created, it doesn't
> have an appreciable effect on the time it takes to extract text from my large
> documents.
> The profiling continues...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.