Osma Suominen commented on JENA-329:

I implemented something like this for the hdtsparql command line tool in the 
hdt-java package. See this PR: https://github.com/rdfhdt/hdt-java/pull/43

In the implementation I used a 1000-slot LRU cache to check for duplicates, 
effectively a sliding window. In the (very limited) testing I performed, this 
seemed to do a good job of eliminating duplicates with good performance. Of 
course it won't guarantee that all duplicates are eliminated, but I agree with 
Andy above that this is a reasonable trade-off. I considered using 
DistinctDataNet as well, which in my understanding would eliminate all 
duplicates, but it would be a lot more costly in terms of resources (disk space 
and IO) for queries with large result sets.

I could do the same for tdbquery (and/or the sparql command line tool) if 
desired. Probably Fuseki as well, though I'm not very familiar with its 

> Add streaming CONSTRUCT results to Fuseki
> -----------------------------------------
>                 Key: JENA-329
>                 URL: https://issues.apache.org/jira/browse/JENA-329
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Fuseki
>            Reporter: Stephen Allen
> As a result of JENA-205, streaming results are now available for CONSTRUCT 
> queries.  However there can be duplicate triples in the iterator.  This task 
> is to allow Fuseki to stream back results, while at the same time performing 
> a distinct operation.
> The fix would be to modify SPARQL_Query to use 
> QueryExecution.execConstructTriples() and filter the results through a 
> DistinctDataNet<Triple> as they are being streamed back to the client.
> This also requires RDFWriter implementations that can accept Iterator<Triple> 
> instead of Model.

This message was sent by Atlassian JIRA

Reply via email to