RE: [jira] performance issue flood

Martinez, Mel - 1004 - MITLL Tue, 19 Jan 2010 09:22:10 -0800

Thanks, Jukka, for committing all those!

Yes, I'll use diff patches for future submissions.  I've done that with
other open source projects. Not all accept them that way so I went with just
submitting full-up mods of the files.

Also - thanks for the _additional_ performance tweaks that you put in!

Phillip asked about expected performance boost.  That depends on what
exactly you are doing.  My submissions were very closely targeted around
text extraction.  I had two phases:  1) load the document and 2) extract the
text.

In my test cases for text extraction to *memory*, from a 'typical' file that
my project expects to see (basically a presentation slide show with text and
graphics) I get a performance improvement of about 17.5 / 8.4 == ~2.1 or so.
However, within the context of other things, the real world improvement will
be less than that because these improvements don't speed up things like the
file output.  Also, the sheer ratio of text objects to other objects inside
the file and how the text objects are stored inside the stream objects ...
there are a lot of variables.

Nevertheless, most any operation that involves parsing a PDF will get SOME
boost from the readUntilEndStream() method fix and many should also get some
boost from some of the various other tweaks.

At this point, most of the load() side of my tests is spent in the stream
read() method - so there is not a lot of improvements left to get from
there.  There is definitely some more room in the 'extract text' side.

Mel

-----Original Message-----
From: Jukka Zitting [mailto:[email protected]] 
Sent: Friday, January 15, 2010 7:17 PM
To: [email protected]
Subject: Re: [jira] performance issue flood

Hi,

On Fri, Jan 15, 2010 at 8:30 AM, Philipp Koch <[email protected]> wrote:
> thanks a lot for your performance optimization contributions.

+1 Good stuff!

> what factor of (overall) speedup is to be expected?

I ran some simple tests and it looks like PDF loading (PDDocument.load
on a File) is now about 20% faster and text extraction
(PDFTextStripper.writeText on an already loaded PDDocument and a dummy
writer) about 30% faster than before Mel's patches and my additional
improvements.

PDFBox is still quite a bit slower than I'd hope, but this is already
a pretty good improvement.

PS. Mel, if you come up with other improvements, it'll be easier for
us to review and apply the changes if you submit them as patches
instead of full copies of the modified files. To create a patch, use
"svn diff" in your checkout.

BR,

Jukka Zitting

smime.p7s
Description: S/MIME cryptographic signature

RE: [jira] performance issue flood

Reply via email to