Hi, I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it I convinced myself it's not really necessary. Here is why.
In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve the scalability of ORDER BY + LIMIT queries avoiding a total sort. In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries avoiding to keep an in-memory data structure of all the already seen bindings. What can we do about DISTINCT + ORDER BY + LIMIT queries? We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue if and only if a binding is not already there. So, this can be viewed as a further improvement of JENA-89. However, I am not convinced anymore that this is really useful or a good idea, since we want to use QueryIterTopN (i.e. heap) for relatively small N in our LIMIT N clause. If N is large, the optimisation described in JENA-90 kicks in and the slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over a small number of results is cheap. Therefore we do not need to do anything special for DISTINCT + ORDER BY + LIMIT. It's better, as Andy suggested, to invest on 'clever' caching and merge joins in TDB. There's not yet a JIRA issue for merge joins in TDB. Paolo
