Rob,

I saw the improvements in TSVInputIterator to go faster - great. I think the problem is that NodeFactory.parse node crudely calls SSE for each string. SSE is a bit heavy to create each time but the real cost is, I believe, that IRIs are being validated and IRI checking is not cheap - I think this is the killer cost.

Your changes avoid the cost of IRI checking but at the expense of checking so these start to be accepted:

<http://example/
<http://example/white  space>
<<<<http://example/>>>>>
_:abc   def

So I put some tests in :-)

The RIOT parsers puts a cache in front of IRI checking to amorize the cost. We could go there with the TSV results parser but how about just being more lenient with what to accept and not fully checking the IRIs?

Can we test my theory it's IRI checking? Can you run your speed checks on the code now in SVN? I'd also appreciate getting a copy of the test data (offlist as it's big).

I've put in code to parse using the RDF term tokenizer which checks for legal tokens but does not fully validate IRIs. It does some light checking of IRIs (= it check for spaces).

If this code is acceptable, it could be rolled into NodeFactory.

Other:

1/ I changed all the exceptions to ResultSetException.
   It happens to already be a subclass of Query Exception (no idea why!)

2/ Shall we allow leading and trailing white space?

   I see no harm, other than the fact it's strictly illegal
   but in the spirit of being generous about what to accept, it seems
   in the spirit of TSV files.

        Andy

Reply via email to