Rob,
I saw the improvements in TSVInputIterator to go faster - great. I
think the problem is that NodeFactory.parse node crudely calls SSE for
each string. SSE is a bit heavy to create each time but the real cost
is, I believe, that IRIs are being validated and IRI checking is not
cheap - I think this is the killer cost.
Your changes avoid the cost of IRI checking but at the expense of
checking so these start to be accepted:
<http://example/
<http://example/white space>
<<<<http://example/>>>>>
_:abc def
So I put some tests in :-)
The RIOT parsers puts a cache in front of IRI checking to amorize the
cost. We could go there with the TSV results parser but how about just
being more lenient with what to accept and not fully checking the IRIs?
Can we test my theory it's IRI checking? Can you run your speed checks
on the code now in SVN? I'd also appreciate getting a copy of the test
data (offlist as it's big).
I've put in code to parse using the RDF term tokenizer which checks for
legal tokens but does not fully validate IRIs. It does some light
checking of IRIs (= it check for spaces).
If this code is acceptable, it could be rolled into NodeFactory.
Other:
1/ I changed all the exceptions to ResultSetException.
It happens to already be a subclass of Query Exception (no idea why!)
2/ Shall we allow leading and trailing white space?
I see no harm, other than the fact it's strictly illegal
but in the spirit of being generous about what to accept, it seems
in the spirit of TSV files.
Andy