Re: TSVInputIterator

Andy Seaborne Tue, 03 Jul 2012 13:27:33 -0700

On 03/07/12 20:54, Rob Vesse wrote:

Yes I was aware that my original change would potentially allow for bad
inputs.  Your new change looks much more robust and is still pretty fast
(13s total)


I wasn't aware of that tokenizer machinery, looks very useful.

It's the tokenizer for RIOT languages - based on fast Java I/O (nosynchronized!)

 I will
look at using that approach to speed up some other stuff we do internally
with parseNode currently.

If it looks good, I can move the code into NodeFactory (surrounding withor without white space handling) and everywhere benefits.


Ideally we should disallow whitespace but it it makes a major difference
in performance I'd be inclined to let it slide

I very much doubt it. The token string is going to flushed into someCPU cache as it is walked several times so checking first and last charsfor a space should be trivial.

I can send you the results data if you really want but it is 650MB so not
sure if you want to bother


Please (or dump on a site and I'll pull it).

        Andy


Rob



On 7/3/12 10:46 AM, "Andy Seaborne" <[email protected]> wrote:

Rob,

I saw the improvements in TSVInputIterator to go faster - great.  I
think the problem is that NodeFactory.parse node crudely calls SSE for
each string.  SSE is a bit heavy to create each time but the real cost
is, I believe, that IRIs are being validated and IRI checking is not
cheap - I think this is the killer cost.

Your changes avoid the cost of IRI checking but at the expense of
checking so these start to be accepted:

<http://example/
<http://example/white  space>
<<<<http://example/>>>>>
_:abc   def

So I put some tests in :-)

The RIOT parsers puts a cache in front of IRI checking to amorize the
cost.  We could go there with the TSV results parser but how about just
being more lenient with what to accept and not fully checking the IRIs?

Can we test my theory it's IRI checking?  Can you run your speed checks
on the code now in SVN?  I'd also appreciate getting a copy of the test
data (offlist as it's big).

I've put in code to parse using the RDF term tokenizer which checks for
legal tokens but does not fully validate IRIs.  It does some light
checking of IRIs (= it check for spaces).

If this code is acceptable, it could be rolled into NodeFactory.

Other:

1/ I changed all the exceptions to ResultSetException.
    It happens to already be a subclass of Query Exception (no idea why!)

2/ Shall we allow leading and trailing white space?

    I see no harm, other than the fact it's strictly illegal
    but in the spirit of being generous about what to accept, it seems
    in the spirit of TSV files.

        Andy

Re: TSVInputIterator

Reply via email to