Yes I was aware that my original change would potentially allow for bad inputs. Your new change looks much more robust and is still pretty fast (13s total)
I wasn't aware of that tokenizer machinery, looks very useful. I will look at using that approach to speed up some other stuff we do internally with parseNode currently. Ideally we should disallow whitespace but it it makes a major difference in performance I'd be inclined to let it slide I can send you the results data if you really want but it is 650MB so not sure if you want to bother Rob On 7/3/12 10:46 AM, "Andy Seaborne" <[email protected]> wrote: >Rob, > >I saw the improvements in TSVInputIterator to go faster - great. I >think the problem is that NodeFactory.parse node crudely calls SSE for >each string. SSE is a bit heavy to create each time but the real cost >is, I believe, that IRIs are being validated and IRI checking is not >cheap - I think this is the killer cost. > >Your changes avoid the cost of IRI checking but at the expense of >checking so these start to be accepted: > ><http://example/ ><http://example/white space> ><<<<http://example/>>>>> >_:abc def > >So I put some tests in :-) > >The RIOT parsers puts a cache in front of IRI checking to amorize the >cost. We could go there with the TSV results parser but how about just >being more lenient with what to accept and not fully checking the IRIs? > >Can we test my theory it's IRI checking? Can you run your speed checks >on the code now in SVN? I'd also appreciate getting a copy of the test >data (offlist as it's big). > >I've put in code to parse using the RDF term tokenizer which checks for >legal tokens but does not fully validate IRIs. It does some light >checking of IRIs (= it check for spaces). > >If this code is acceptable, it could be rolled into NodeFactory. > >Other: > >1/ I changed all the exceptions to ResultSetException. > It happens to already be a subclass of Query Exception (no idea why!) > >2/ Shall we allow leading and trailing white space? > > I see no harm, other than the fact it's strictly illegal > but in the spirit of being generous about what to accept, it seems > in the spirit of TSV files. > > Andy
