This tells us that the focus needs to be on PRTokeniser and RAFOA. For what it's worth, these have also shown up as the bottlenecks in profiling I've done in the parser package (not too surprising).
I'll discuss each in turn: RandomAccessFileOrArray - I've done a lot of thinking on this one in the past. This is an interesting class that can have massive impact on performance, depending on how you use it. For example, if you load your source PDF entirely into memory, and pass the bytes into RAFOA, it will remove IO bottlenecks. Giovanni - if your source PDFs are small enough, you might want to try this, just to get a feel for the impact that IO blocking is having on your results (read entire PDF into byte[] and use PdfReader(byte[])) The next thing that I looked at was buffering (a naive use of RandomAccessFile is horrible for performance, and significant gains can be had by implementing a paging strategy). I actually implemented a paging RandomAccessFile and started work on rolling it into RAFOA last year, but my benchmarks showed that the memory mapped strategy that RAFOA uses had equivalent performance to the paging RAF implementation. These tests weren't conclusive, so there may still be some things to learn in this area. The one problem with the memory mapped strategy (in it's current implementation) is that really big source PDFs still can't be loaded into memory. This could be addressed by using a paging strategy on the mapped portion of the file - probably keep 10 or 15 mapped regions in an MRU cache (maybe 1MB in size each). For reference, the ugly (really ugly) hack that determines whether RAFOA will use memory mapped IO is the Document.plainRandomAccess static public variable (shudder). So what about the code paths in PRTokeniser.nextToken()? We've got a number of tight loops reading individual characters from the RAFOA. If the backing source isn't buffered, this would be a problem, but I don't know that is really the issue here (it would be worth measuring though...) The StringBuffer could definitely be replaced with a StringBuilder, and it could be re-used instead of re-allocating for each call to nextTokeen() (this would probably help quite a bit, as I'll bet the default size of the backing buffer has to keep growing during dictionary reads). Another thing that could make a difference is ordering of the case and if's - for example, the default: branch turns around and does a check for (ch == '-' || ch == '+' || ch == '.' || (ch >= '0' && ch <= '9'). Changing this to be: case '-': case '+': case '.': case '0': ... case '9': May be better. The loops that check for while (ch != -1 && ((ch >= '0' && ch <= '9') || ch == '.')) could also probably be optimized by removing the && ch != -1 check - the other conditions ensure that the loop will escape if ch==-1 It might be interesting to break the top level parsing branches into separate functions so the profiler tell us which of these main branches is consuming the bulk of the run time. Those are the obvious low hanging fruit that I see. Final point: I've seen some comments suggesting inlining of some code. Modern VMs are quite good at doing this sort of inlining automatically - a test would be advisable before worrying about it too much. Having things split out actually makes it easier to use a profiler to determine where the bottleneck is. One thing that is quite clear here is that we need to have some sort of benchmark that we can use for evaluation - for example, if I had a good benchmark test, I would have just tried the ideas above to see how they fared. - K Giovanni Azua-2 wrote: > > > On Apr 22, 2010, at 11:18 PM, trumpetinc wrote: >> >> I like your approach! A simple if (ch > 32) return false; at the very >> top >> would give the most bang for the least effort (if you do go the bitmask >> route, be sure to include unit tests!). > > > Doing this change spares approximately two seconds out of the full > workload so now shows 8s instead of 10s and isWhitespace stays at 1%. > > The numbers below include two extra changes: the one from trumpetinc above > and migrating all StringBuffer references to use instead StringBuilder. > > The top are now: > > PRTokeniser.nextToken 8% 77s 19'268'000 > invocations > RandomAccessFileOrArray.read 6% 53s 149'047'680 invocations > MappedRandomAccessFile.read 3% 26s 61'065'680 invocations > PdfReader.removeUnusedCode 1% 15s 6000 invocations > PdfEncodings.convertToBytes 1% 15s 5'296'207 invocations > PRTokeniser.nextValidToken 1% 12s 9'862'000 invocations > PdfReader.readPRObject 1% 10s 5'974'000 invocations > ByteBuffer.append(char) 1% 10s 19'379'382 > invocations > PRTokeniser.backOnePosition 1% 10s 17'574'000 invocations > PRTokeniser.isWhitespace 1% 8s 35'622'000 invocations > > A bit further down there is ByteBuffer.append_i that often needs to > reallocate and do an array copy thus the expensive ByBuffer.append(char) > above ... I am playing right now with bigger initial sizes e.g. 512 > instead of 127 ... > > Best regards, > Giovanni > ------------------------------------------------------------------------------ > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: > http://1t3xt.info/tutorials/keywords/ > > -- View this message in context: http://old.nabble.com/performance-follow-up-tp28322800p28344041.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/