Re: SPARQL queries with DISTINCT

Simon Helsen Mon, 08 Aug 2011 06:36:35 -0700

On this topic, I'd like to point out that we have a separate outside 
mechanism for paging which behaves like you suggest, however, the 
difference with OFFSET/LIMIT is that the next time someone makes the query 
e.g. to obtain the next page, we do expect the query to be recalculated 
since the state of the store may have changed.


So, if you plan to change the behavior by introducing a caching model, you 
may actually alter the behavior unless you are able to determine that a 
subsequent execution of a query would not have changed results (e.g. by 
having the actions isolated in a transaction?)

Simon



From:
Andy Seaborne <[email protected]>
To:
[email protected]
Date:
08/06/2011 01:27 PM
Subject:
Re: SPARQL queries with DISTINCT





On 05/08/11 17:50, Stephen Allen wrote:
> Hi Paulo,
>
> A probabilistic DISTINCT operator could be useful for some cases I can 
think
> of.  But you'd probably want to make it an extension to SPARQL since a 
lot
> (most?) of the time you need correct results.  Unfortunately I don't 
think
> bloom filters would help even with REDUCED queries since we pretty much 
need
> the opposite behavior, false negatives being OK but no false positives.
>
> In terms of DISTINCT implementations, the two choices are a hashset or
> sorting.  Both of those cases could be rewritten with spill-to-disk 
backing
> stores if the set got too large.  In this respect it seems to be similar 
to
> JENA-44 and JENA-45.  A hashset approach provides lower latency to the 
first
> result, but the sorting approach could be faster in total execution time
> (especially when you have to start spilling to disk).  In fact, a number 
of
> RDBMSs offer two options for constructing query plans, one minimizes 
total
> execution time and the other minimizes the time to retrieve the first
> result.

Good point - the optimizer could see if a query is is DISTINCT-ORDER BY 
and, if so, use OpReduce instead of OpDistinct.

REDUCED in ARQ is remove adjacent duplicates.

> I agree it would be cool if the query operators were aware of sorted 
inputs
> so that DISTINCT could be implemented cheaply if there were an 
underlying
> ORDER BY (or even if the triple/quad index provided the bindings in the
> correct sort order).  In this same vein it also opens up possibilities 
for
> cheap merge joins instead of nested loop joins.

The other useful thing to spot is ORDER BY-OFFSET-LIMIT; as this 
calculates the whole results, they can be cached and then sliced. 
Because this is "paging", it is reasonable to expect a query soon with 
the same setup but OFFSET+=pagesize


On merge joins -

TDB indexes do return predictably ordered results from a index access. 
Merge joins can be done and would be especially useful to apply to the 
first two triple patterns (later ones depend on whether the result of 
the first two come up in a useful order).

it's not so much the ORDER BY but the fact that TDB range indexes happen 
to do the right thing.

                 Andy

>
> -Stephen
>
>
> -----Original Message-----
> From: Paolo Castagna [mailto:[email protected]]
> Sent: Friday, August 05, 2011 11:59 AM
> To: [email protected]
> Subject: SPARQL queries with DISTINCT
>
> Hi,
> in ARQ there is a QueryIterDistinct which physically implements the
> OpDistinct (i.e. the logical operator form the SPARQL algebra package).
>
> QueryIterDistinct [1] uses an HashSet<Binding>  to keep the already seen
> bindings and generate a sequence of unique bindings.
>
> For 'large' result sets the HashSet can use a 'lot' of RAM.
>
> Could we use a bloom filter [2] instead?
>
> For each binding, QueryIterDistinct will ask the bloom filter: "have
> you seen this binding"? If the answer is no, we can be absolutely sure
> the binding is part of the solution. However, if the answer is yes
> (i.e. "I've seen this binding already") we cannot be absolutely sure
> and we will not have a way to check for sure. We would suppress that
> binding, incurring the risk of not including it in the final solution.
> (i.e. false positives are possible, although with a low probability).
>
> Would the use of a bloom filter (large enough to make the probability
> of false positives small enough) be acceptable in a QueryIterDistinct
> implementation?
>
> Do you have other ideas on how we could reduce the memory consumption
> for DISTINCT iterators over large result sets?
>
> Another (interesting) opportunity of optimization is when DISTINCT is
> used with ORDER BY, but I've not looked into it yet.
>
> Cheers,
> Paolo
>
>   [1]
> 
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/h

> pl/jena/sparql/engine/iterator/QueryIterDistinct.java
>   [2] http://en.wikipedia.org/wiki/Bloom_filter
>

Re: SPARQL queries with DISTINCT

Reply via email to