Ok, you'll be able to get the file from https://dl.dropbox.com/u/590790/test.tsv now
Yes if that code could go into NodeFactory in some form or other that'd be great Rob On 7/3/12 1:26 PM, "Andy Seaborne" <[email protected]> wrote: >On 03/07/12 20:54, Rob Vesse wrote: >> Yes I was aware that my original change would potentially allow for bad >> inputs. Your new change looks much more robust and is still pretty fast >> (13s total) >> >> I wasn't aware of that tokenizer machinery, looks very useful. > >It's the tokenizer for RIOT languages - based on fast Java I/O (no >synchronized!) > >> I will >> look at using that approach to speed up some other stuff we do >>internally >> with parseNode currently. > >If it looks good, I can move the code into NodeFactory (surrounding with >or without white space handling) and everywhere benefits. > >> >> Ideally we should disallow whitespace but it it makes a major difference >> in performance I'd be inclined to let it slide > >I very much doubt it. The token string is going to flushed into some >CPU cache as it is walked several times so checking first and last chars >for a space should be trivial. > >> I can send you the results data if you really want but it is 650MB so >>not >> sure if you want to bother > >Please (or dump on a site and I'll pull it). > > Andy > >> >> Rob >> >> >> >> On 7/3/12 10:46 AM, "Andy Seaborne" <[email protected]> wrote: >> >>> Rob, >>> >>> I saw the improvements in TSVInputIterator to go faster - great. I >>> think the problem is that NodeFactory.parse node crudely calls SSE for >>> each string. SSE is a bit heavy to create each time but the real cost >>> is, I believe, that IRIs are being validated and IRI checking is not >>> cheap - I think this is the killer cost. >>> >>> Your changes avoid the cost of IRI checking but at the expense of >>> checking so these start to be accepted: >>> >>> <http://example/ >>> <http://example/white space> >>> <<<<http://example/>>>>> >>> _:abc def >>> >>> So I put some tests in :-) >>> >>> The RIOT parsers puts a cache in front of IRI checking to amorize the >>> cost. We could go there with the TSV results parser but how about just >>> being more lenient with what to accept and not fully checking the IRIs? >>> >>> Can we test my theory it's IRI checking? Can you run your speed checks >>> on the code now in SVN? I'd also appreciate getting a copy of the test >>> data (offlist as it's big). >>> >>> I've put in code to parse using the RDF term tokenizer which checks for >>> legal tokens but does not fully validate IRIs. It does some light >>> checking of IRIs (= it check for spaces). >>> >>> If this code is acceptable, it could be rolled into NodeFactory. >>> >>> Other: >>> >>> 1/ I changed all the exceptions to ResultSetException. >>> It happens to already be a subclass of Query Exception (no idea >>>why!) >>> >>> 2/ Shall we allow leading and trailing white space? >>> >>> I see no harm, other than the fact it's strictly illegal >>> but in the spirit of being generous about what to accept, it seems >>> in the spirit of TSV files. >>> >>> Andy >> > >
