[
https://issues.apache.org/jira/browse/JENA-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869732#comment-15869732
]
Osma Suominen commented on JENA-329:
------------------------------------
I implemented something like this for the hdtsparql command line tool in the
hdt-java package. See this PR: https://github.com/rdfhdt/hdt-java/pull/43
In the implementation I used a 1000-slot LRU cache to check for duplicates,
effectively a sliding window. In the (very limited) testing I performed, this
seemed to do a good job of eliminating duplicates with good performance. Of
course it won't guarantee that all duplicates are eliminated, but I agree with
Andy above that this is a reasonable trade-off. I considered using
DistinctDataNet as well, which in my understanding would eliminate all
duplicates, but it would be a lot more costly in terms of resources (disk space
and IO) for queries with large result sets.
I could do the same for tdbquery (and/or the sparql command line tool) if
desired. Probably Fuseki as well, though I'm not very familiar with its
internals.
> Add streaming CONSTRUCT results to Fuseki
> -----------------------------------------
>
> Key: JENA-329
> URL: https://issues.apache.org/jira/browse/JENA-329
> Project: Apache Jena
> Issue Type: Improvement
> Components: Fuseki
> Reporter: Stephen Allen
>
> As a result of JENA-205, streaming results are now available for CONSTRUCT
> queries. However there can be duplicate triples in the iterator. This task
> is to allow Fuseki to stream back results, while at the same time performing
> a distinct operation.
> The fix would be to modify SPARQL_Query to use
> QueryExecution.execConstructTriples() and filter the results through a
> DistinctDataNet<Triple> as they are being streamed back to the client.
> This also requires RDFWriter implementations that can accept Iterator<Triple>
> instead of Model.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)