On 04/07/12 11:38, Andy Seaborne wrote:
On 03/07/12 22:20, Rob Vesse wrote:
Ok, you'll be able to get the file from
https://dl.dropbox.com/u/590790/test.tsv now

Yes if that code could go into NodeFactory in some form or other
that'd be
great

Rob


Rob,

I've grabbed a local copy - thanks.  It's x20 smaller when compressed!

I'll profile the new code to see where the time is going - a quick code
inspection suggests that creating the tokenizer isn't too bad but there
are a number of (cheap?) objects created and this is for every item so
maybe that's just a little too much.

Profiling:

49% in the TokenizerText.hasNext , 46% in TokenizerText.readIRI.
29% in BufferedReader.readline (sigh)
13% in Pattern.split

Much of RIOT goes into avoiding standard Java that does synchronized (there's a hidden lock) despite the fact RIOT is in control and single threaded. But it only 29%.

TokenizerText.readIRI is the main sink of time.

Conclusion: tokenizer creation is not significant (yipee! it's not new StringBuilder!)

There are several final classes used directly, not as interfaces, like PeekIterator that really do matter as they are in the inner loops so the kind of method call matters (from previous investigations).

Rewrote readIRI as a switch statement - no change, maybe slightly worse.

No method calls below this show so I guess the JIT is removing all the method calls to same-class private methods

Conclusion: the time is going on misc single character read/checking. Not surprising - but it confirms the time is not going where it shouldn't.

------

Later ...

I've now centralized the code in NodeFactory (actually, pushed a lot into Token itself). This seems to add less than 0.5s

        Andy

If we wanted it to go as fast as possible, we could have a version of
TokenizerText that output tokens for raw tab and new lines (not treat
them as surpressable white space) then use a tokenizer on the input data
directly.  That's more than just a little tweaking though so first
understand/improve what we've got.

Reply via email to