On Tue, May 19, 2009 at 11:44:19AM -0400, Dave Robillard wrote: > > > The problem is that the serializer does not know if the triple stream is > > > sorted. This is solvable easily enough, it just needs doing... > > > > Well, why should it bother if the triple stream is sorted? I've just > > checked the spec one more time and there is no single word about it. > > ... because it has to be sorted to be able to write abbreviated turtle?
But if it is not sorted and serialized as is (probably not optimally) the resulting file will be a valid turtle, right? And the fact is that rapper currently fails to sort input if it is big enough! I think rapper should do it's job only - serializing - and not try to do what 'sort' does good enough and fast. Or maybe there should be an option somewhere for switching this behaviour off... > > > P.S. 'sort' on an ntriples file won't actually give you properly sorted > > > triples, and the problem remains that the serialiser needs to know > > > anyway > > > > Why not? And what do you mean by 'properly sorted'? > > Uh... alphabetical order by text is obviously not equal to the (SPO) > sorted order True. > needed to write abbreviated turtle: I think not. I've attached an archive with six files: * notsorted.ntriples: your original example * notsorted-rapper.turtle: first file converted to turtle by rapper * notsorted-mytool.turtle: first file converted to turtle by my tool * sorted.ntriples: your example sorted by 'sort' * sorted-rapper.turtle: third file converted to turtle by rapper * sorted-mytool.turtle: third file converted to turtle by my tool Note that my tool doesn't sort input at all, just like I proposed. BTW, rapper doesn't sort as well. It just tries to abbreviate everything optimally by size (but not speed). You can see that everything is abbreviated ok. This example only shows that alphabetical sort is ok. I don't have time at the moment to provide a better example with lots of triples but I can do it later if you wish. IMO much of the slowdown in rapper is because rapper puts subject in an array (I checked the code) which has O(n) seek time instead of some binary search tree which will give O(log N) seek time. BTW STL has very fast rbtree implementation. std::set or std::map are based on them. > <http://a> <http://b> <http://c> > <http://ab> <http://b> <http://c> > > The first triple should be the first in increasing sorted order (because > the subject URI is < the second), but textually sorted this is wrong > (because the character '>' is > 'b'). -- Alexander
sorttest.tgz
Description: GNU Unix tar archive
_______________________________________________ redland-dev mailing list [email protected] http://lists.librdf.org/mailman/listinfo/redland-dev
