Re: TSVInputIterator

Andy Seaborne Wed, 04 Jul 2012 06:27:41 -0700

On 04/07/12 11:38, Andy Seaborne wrote:

On 03/07/12 22:20, Rob Vesse wrote:

Ok, you'll be able to get the file from
https://dl.dropbox.com/u/590790/test.tsv now


Yes if that code could go into NodeFactory in some form or other
that'd be
great

Rob


Rob,

I've grabbed a local copy - thanks.  It's x20 smaller when compressed!

I'll profile the new code to see where the time is going - a quick code
inspection suggests that creating the tokenizer isn't too bad but there
are a number of (cheap?) objects created and this is for every item so
maybe that's just a little too much.


Profiling:

49% in the TokenizerText.hasNext , 46% in TokenizerText.readIRI.
29% in BufferedReader.readline (sigh)
13% in Pattern.split

Much of RIOT goes into avoiding standard Java that does synchronized(there's a hidden lock) despite the fact RIOT is in control and singlethreaded. But it only 29%.


TokenizerText.readIRI is the main sink of time.

Conclusion: tokenizer creation is not significant (yipee! it's not newStringBuilder!)

There are several final classes used directly, not as interfaces, likePeekIterator that really do matter as they are in the inner loops so thekind of method call matters (from previous investigations).


Rewrote readIRI as a switch statement - no change, maybe slightly worse.

No method calls below this show so I guess the JIT is removing all themethod calls to same-class private methods

Conclusion: the time is going on misc single character read/checking.Not surprising - but it confirms the time is not going where it shouldn't.


------

Later ...

I've now centralized the code in NodeFactory (actually, pushed a lotinto Token itself). This seems to add less than 0.5s


        Andy

If we wanted it to go as fast as possible, we could have a version of
TokenizerText that output tokens for raw tab and new lines (not treat
them as surpressable white space) then use a tokenizer on the input data
directly.  That's more than just a little tweaking though so first
understand/improve what we've got.

Re: TSVInputIterator

Reply via email to