TSVInputIterator

Andy Seaborne Tue, 03 Jul 2012 10:47:25 -0700

Rob,

I saw the improvements in TSVInputIterator to go faster - great. Ithink the problem is that NodeFactory.parse node crudely calls SSE foreach string. SSE is a bit heavy to create each time but the real costis, I believe, that IRIs are being validated and IRI checking is notcheap - I think this is the killer cost.

Your changes avoid the cost of IRI checking but at the expense ofchecking so these start to be accepted:


<http://example/
<http://example/white  space>
<<<<http://example/>>>>>
_:abc   def

So I put some tests in :-)

The RIOT parsers puts a cache in front of IRI checking to amorize thecost. We could go there with the TSV results parser but how about justbeing more lenient with what to accept and not fully checking the IRIs?

Can we test my theory it's IRI checking? Can you run your speed checkson the code now in SVN? I'd also appreciate getting a copy of the testdata (offlist as it's big).

I've put in code to parse using the RDF term tokenizer which checks forlegal tokens but does not fully validate IRIs. It does some lightchecking of IRIs (= it check for spaces).


If this code is acceptable, it could be rolled into NodeFactory.

Other:

1/ I changed all the exceptions to ResultSetException.
   It happens to already be a subclass of Query Exception (no idea why!)

2/ Shall we allow leading and trailing white space?

   I see no harm, other than the fact it's strictly illegal
   but in the spirit of being generous about what to accept, it seems
   in the spirit of TSV files.

        Andy

TSVInputIterator

Reply via email to