Yes I was aware that my original change would potentially allow for bad
inputs.  Your new change looks much more robust and is still pretty fast
(13s total)

I wasn't aware of that tokenizer machinery, looks very useful.  I will
look at using that approach to speed up some other stuff we do internally
with parseNode currently.

Ideally we should disallow whitespace but it it makes a major difference
in performance I'd be inclined to let it slide

I can send you the results data if you really want but it is 650MB so not
sure if you want to bother

Rob



On 7/3/12 10:46 AM, "Andy Seaborne" <[email protected]> wrote:

>Rob,
>
>I saw the improvements in TSVInputIterator to go faster - great.  I
>think the problem is that NodeFactory.parse node crudely calls SSE for
>each string.  SSE is a bit heavy to create each time but the real cost
>is, I believe, that IRIs are being validated and IRI checking is not
>cheap - I think this is the killer cost.
>
>Your changes avoid the cost of IRI checking but at the expense of
>checking so these start to be accepted:
>
><http://example/
><http://example/white  space>
><<<<http://example/>>>>>
>_:abc   def
>
>So I put some tests in :-)
>
>The RIOT parsers puts a cache in front of IRI checking to amorize the
>cost.  We could go there with the TSV results parser but how about just
>being more lenient with what to accept and not fully checking the IRIs?
>
>Can we test my theory it's IRI checking?  Can you run your speed checks
>on the code now in SVN?  I'd also appreciate getting a copy of the test
>data (offlist as it's big).
>
>I've put in code to parse using the RDF term tokenizer which checks for
>legal tokens but does not fully validate IRIs.  It does some light
>checking of IRIs (= it check for spaces).
>
>If this code is acceptable, it could be rolled into NodeFactory.
>
>Other:
>
>1/ I changed all the exceptions to ResultSetException.
>    It happens to already be a subclass of Query Exception (no idea why!)
>
>2/ Shall we allow leading and trailing white space?
>
>    I see no harm, other than the fact it's strictly illegal
>    but in the spirit of being generous about what to accept, it seems
>    in the spirit of TSV files.
>
>       Andy

Reply via email to