Thanks, Jukka, for committing all those! Yes, I'll use diff patches for future submissions. I've done that with other open source projects. Not all accept them that way so I went with just submitting full-up mods of the files.
Also - thanks for the _additional_ performance tweaks that you put in! Phillip asked about expected performance boost. That depends on what exactly you are doing. My submissions were very closely targeted around text extraction. I had two phases: 1) load the document and 2) extract the text. In my test cases for text extraction to *memory*, from a 'typical' file that my project expects to see (basically a presentation slide show with text and graphics) I get a performance improvement of about 17.5 / 8.4 == ~2.1 or so. However, within the context of other things, the real world improvement will be less than that because these improvements don't speed up things like the file output. Also, the sheer ratio of text objects to other objects inside the file and how the text objects are stored inside the stream objects ... there are a lot of variables. Nevertheless, most any operation that involves parsing a PDF will get SOME boost from the readUntilEndStream() method fix and many should also get some boost from some of the various other tweaks. At this point, most of the load() side of my tests is spent in the stream read() method - so there is not a lot of improvements left to get from there. There is definitely some more room in the 'extract text' side. Mel -----Original Message----- From: Jukka Zitting [mailto:[email protected]] Sent: Friday, January 15, 2010 7:17 PM To: [email protected] Subject: Re: [jira] performance issue flood Hi, On Fri, Jan 15, 2010 at 8:30 AM, Philipp Koch <[email protected]> wrote: > thanks a lot for your performance optimization contributions. +1 Good stuff! > what factor of (overall) speedup is to be expected? I ran some simple tests and it looks like PDF loading (PDDocument.load on a File) is now about 20% faster and text extraction (PDFTextStripper.writeText on an already loaded PDDocument and a dummy writer) about 30% faster than before Mel's patches and my additional improvements. PDFBox is still quite a bit slower than I'd hope, but this is already a pretty good improvement. PS. Mel, if you come up with other improvements, it'll be easier for us to review and apply the changes if you submit them as patches instead of full copies of the modified files. To create a patch, use "svn diff" in your checkout. BR, Jukka Zitting
smime.p7s
Description: S/MIME cryptographic signature
