On 04/07/12 11:38, Andy Seaborne wrote:
On 03/07/12 22:20, Rob Vesse wrote:
Ok, you'll be able to get the file from
https://dl.dropbox.com/u/590790/test.tsv now
Yes if that code could go into NodeFactory in some form or other
that'd be
great
Rob
Rob,
I've grabbed a local copy - thanks. It's x20 smaller when compressed!
I'll profile the new code to see where the time is going - a quick code
inspection suggests that creating the tokenizer isn't too bad but there
are a number of (cheap?) objects created and this is for every item so
maybe that's just a little too much.
Profiling:
49% in the TokenizerText.hasNext , 46% in TokenizerText.readIRI.
29% in BufferedReader.readline (sigh)
13% in Pattern.split
Much of RIOT goes into avoiding standard Java that does synchronized
(there's a hidden lock) despite the fact RIOT is in control and single
threaded. But it only 29%.
TokenizerText.readIRI is the main sink of time.
Conclusion: tokenizer creation is not significant (yipee! it's not new
StringBuilder!)
There are several final classes used directly, not as interfaces, like
PeekIterator that really do matter as they are in the inner loops so the
kind of method call matters (from previous investigations).
Rewrote readIRI as a switch statement - no change, maybe slightly worse.
No method calls below this show so I guess the JIT is removing all the
method calls to same-class private methods
Conclusion: the time is going on misc single character read/checking.
Not surprising - but it confirms the time is not going where it shouldn't.
------
Later ...
I've now centralized the code in NodeFactory (actually, pushed a lot
into Token itself). This seems to add less than 0.5s
Andy
If we wanted it to go as fast as possible, we could have a version of
TokenizerText that output tokens for raw tab and new lines (not treat
them as surpressable white space) then use a tokenizer on the input data
directly. That's more than just a little tweaking though so first
understand/improve what we've got.