This tells us that the focus needs to be on PRTokeniser and RAFOA.  For what
it's worth, these have also shown up as the bottlenecks in profiling I've
done in the parser package (not too surprising).

I'll discuss each in turn:

RandomAccessFileOrArray - I've done a lot of thinking on this one in the
past.  This is an interesting class that can have massive impact on
performance, depending on how you use it.  For example, if you load your
source PDF entirely into memory, and pass the bytes into RAFOA, it will
remove IO bottlenecks.  

Giovanni - if your source PDFs are small enough, you might want to try this,
just to get a feel for the impact that IO blocking is having on your results
(read entire PDF into byte[] and use PdfReader(byte[]))


The next thing that I looked at was buffering (a naive use of
RandomAccessFile is horrible for performance, and significant gains can be
had by implementing a paging strategy).  I actually implemented a paging
RandomAccessFile and started work on rolling it into RAFOA last year, but my
benchmarks showed that the memory mapped strategy that RAFOA uses had
equivalent performance to the paging RAF implementation.

These tests weren't conclusive, so there may still be some things to learn
in this area.

The one problem with the memory mapped strategy (in it's current
implementation) is that really big source PDFs still can't be loaded into
memory.  This could be addressed by using a paging strategy on the mapped
portion of the file - probably keep 10 or 15 mapped regions in an MRU cache
(maybe 1MB in size each).

For reference, the ugly (really ugly) hack that determines whether RAFOA
will use memory mapped IO is the Document.plainRandomAccess static public
variable (shudder).




So what about the code paths in PRTokeniser.nextToken()?

We've got a number of tight loops reading individual characters from the
RAFOA.  If the backing source isn't buffered, this would be a problem, but I
don't know that is really the issue here (it would be worth measuring
though...)

The StringBuffer could definitely be replaced with a StringBuilder, and it
could be re-used instead of re-allocating for each call to nextTokeen()
(this would probably help quite a bit, as I'll bet the default size of the
backing buffer has to keep growing during dictionary reads).

Another thing that could make a difference is ordering of the case and if's
- for example, the default: branch turns around and does a check for (ch ==
'-' || ch == '+' || ch == '.' || (ch >= '0' && ch <= '9').  Changing this to
be:

case '-':
case '+':
case '.':
case '0':
...
case '9':

May be better.


The loops that check for while (ch != -1 && ((ch >= '0' && ch <= '9') || ch
== '.')) could also probably be optimized by removing the && ch != -1 check
- the other conditions ensure that the loop will escape if ch==-1


It might be interesting to break the top level parsing branches into
separate functions so the profiler tell us which of these main branches is
consuming the bulk of the run time.


Those are the obvious low hanging fruit that I see.

Final point:  I've seen some comments suggesting inlining of some code. 
Modern VMs are quite good at doing this sort of inlining automatically - a
test would be advisable before worrying about it too much.  Having things
split out actually makes it easier to use a profiler to determine where the
bottleneck is.


One thing that is quite clear here is that we need to have some sort of
benchmark that we can use for evaluation - for example, if I had a good
benchmark test, I would have just tried the ideas above to see how they
fared.

- K


Giovanni Azua-2 wrote:
> 
> 
> On Apr 22, 2010, at 11:18 PM, trumpetinc wrote:
>> 
>> I like your approach!  A simple if (ch > 32) return false; at the very
>> top
>> would give the most bang for the least effort (if you do go the bitmask
>> route, be sure to include unit tests!).
> 
> 
> Doing this change spares approximately two seconds out of the full
> workload so now shows 8s instead of 10s and isWhitespace stays at 1%.
> 
> The numbers below include two extra changes: the one from trumpetinc above
> and migrating all StringBuffer references to use instead StringBuilder.
> 
> The top are now:
> 
> PRTokeniser.nextToken                  8%   77s     19'268'000 
> invocations
> RandomAccessFileOrArray.read   6%   53s   149'047'680 invocations
> MappedRandomAccessFile.read  3%   26s      61'065'680 invocations
> PdfReader.removeUnusedCode   1%  15s                 6000 invocations
> PdfEncodings.convertToBytes       1%   15s        5'296'207 invocations    
> PRTokeniser.nextValidToken        1%    12s       9'862'000 invocations
> PdfReader.readPRObject               1%    10s       5'974'000 invocations
> ByteBuffer.append(char)                 1%    10s     19'379'382
> invocations
> PRTokeniser.backOnePosition      1%    10s     17'574'000 invocations
> PRTokeniser.isWhitespace             1%    8s       35'622'000 invocations 
> 
> A bit further down there is ByteBuffer.append_i that often needs to
> reallocate and do an array copy thus the expensive ByBuffer.append(char)
> above ... I am playing right now with bigger initial sizes e.g. 512
> instead of 127 ...    
> 
> Best regards,
> Giovanni
> ------------------------------------------------------------------------------
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> 
> 

-- 
View this message in context: 
http://old.nabble.com/performance-follow-up-tp28322800p28344041.html
Sent from the iText - General mailing list archive at Nabble.com.


------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to