Re: TSVInputIterator

Rob Vesse Tue, 03 Jul 2012 14:25:05 -0700

Ok, you'll be able to get the file from
https://dl.dropbox.com/u/590790/test.tsv now


Yes if that code could go into NodeFactory in some form or other that'd be
great

Rob


On 7/3/12 1:26 PM, "Andy Seaborne" <[email protected]> wrote:

>On 03/07/12 20:54, Rob Vesse wrote:
>> Yes I was aware that my original change would potentially allow for bad
>> inputs.  Your new change looks much more robust and is still pretty fast
>> (13s total)
>>
>> I wasn't aware of that tokenizer machinery, looks very useful.
>
>It's the tokenizer for RIOT languages - based on fast Java I/O (no
>synchronized!)
>
>>  I will
>> look at using that approach to speed up some other stuff we do
>>internally
>> with parseNode currently.
>
>If it looks good, I can move the code into NodeFactory (surrounding with
>or without white space handling) and everywhere benefits.
>
>>
>> Ideally we should disallow whitespace but it it makes a major difference
>> in performance I'd be inclined to let it slide
>
>I very much doubt it.  The token string is going to flushed into some
>CPU cache as it is walked several times so checking first and last chars
>for a space should be trivial.
>
>> I can send you the results data if you really want but it is 650MB so
>>not
>> sure if you want to bother
>
>Please (or dump on a site and I'll pull it).
>
>       Andy
>
>>
>> Rob
>>
>>
>>
>> On 7/3/12 10:46 AM, "Andy Seaborne" <[email protected]> wrote:
>>
>>> Rob,
>>>
>>> I saw the improvements in TSVInputIterator to go faster - great.  I
>>> think the problem is that NodeFactory.parse node crudely calls SSE for
>>> each string.  SSE is a bit heavy to create each time but the real cost
>>> is, I believe, that IRIs are being validated and IRI checking is not
>>> cheap - I think this is the killer cost.
>>>
>>> Your changes avoid the cost of IRI checking but at the expense of
>>> checking so these start to be accepted:
>>>
>>> <http://example/
>>> <http://example/white  space>
>>> <<<<http://example/>>>>>
>>> _:abc   def
>>>
>>> So I put some tests in :-)
>>>
>>> The RIOT parsers puts a cache in front of IRI checking to amorize the
>>> cost.  We could go there with the TSV results parser but how about just
>>> being more lenient with what to accept and not fully checking the IRIs?
>>>
>>> Can we test my theory it's IRI checking?  Can you run your speed checks
>>> on the code now in SVN?  I'd also appreciate getting a copy of the test
>>> data (offlist as it's big).
>>>
>>> I've put in code to parse using the RDF term tokenizer which checks for
>>> legal tokens but does not fully validate IRIs.  It does some light
>>> checking of IRIs (= it check for spaces).
>>>
>>> If this code is acceptable, it could be rolled into NodeFactory.
>>>
>>> Other:
>>>
>>> 1/ I changed all the exceptions to ResultSetException.
>>>     It happens to already be a subclass of Query Exception (no idea
>>>why!)
>>>
>>> 2/ Shall we allow leading and trailing white space?
>>>
>>>     I see no harm, other than the fact it's strictly illegal
>>>     but in the spirit of being generous about what to accept, it seems
>>>     in the spirit of TSV files.
>>>
>>>     Andy
>>
>
>

Re: TSVInputIterator

Reply via email to