Hi, in ARQ there is a QueryIterDistinct which physically implements the OpDistinct (i.e. the logical operator form the SPARQL algebra package).
QueryIterDistinct [1] uses an HashSet<Binding> to keep the already seen bindings and generate a sequence of unique bindings. For 'large' result sets the HashSet can use a 'lot' of RAM. Could we use a bloom filter [2] instead? For each binding, QueryIterDistinct will ask the bloom filter: "have you seen this binding"? If the answer is no, we can be absolutely sure the binding is part of the solution. However, if the answer is yes (i.e. "I've seen this binding already") we cannot be absolutely sure and we will not have a way to check for sure. We would suppress that binding, incurring the risk of not including it in the final solution. (i.e. false positives are possible, although with a low probability). Would the use of a bloom filter (large enough to make the probability of false positives small enough) be acceptable in a QueryIterDistinct implementation? Do you have other ideas on how we could reduce the memory consumption for DISTINCT iterators over large result sets? Another (interesting) opportunity of optimization is when DISTINCT is used with ORDER BY, but I've not looked into it yet. Cheers, Paolo [1] https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/iterator/QueryIterDistinct.java [2] http://en.wikipedia.org/wiki/Bloom_filter
