Re: [jena] JENA-626 SPARQL Query Caching (#95)

A. Soroka Mon, 19 Oct 2015 03:55:09 -0700

> (Apparently Hadoop, the main reason we shade Guava does now in fact work with 
> later Guava's than its dependency on an old Guava version.)


Andy, Rob— in this light, do you think it’s worth trying an experiment to 
unshade Guava? From what I understood of a previous conversation about this 
question, the only real way to be sure that Jena and Hadoop are playing well is 
to actually make some deployments and try out different behaviors; there’s no 
way to write a simple integration test…

---
A. Soroka
The University of Virginia Library

> On Oct 19, 2015, at 6:49 AM, Andy Seaborne <[email protected]> wrote:
> 
> These comments focus on the architecture of the proposal.
> 
> Capturing output
> 
> ResponseResultSet/ResultSetFormatter.
> 
> See also the discussion point below for a different approach to caching that 
> is not based on capturing the written output.
> 
> The current design requires changes to ResultSetFormatter and every output 
> format.
> 
> An alternative create a replicating OutputStream which can output to two 
> places, one of which can be the string capture. This localises changes to 
> Fuseki only.
> 
> Sketch (there may be better ways to achieve the same effect).
> 
> class OutputStream2 extends OutputStream
> public OutputStream2(OutputStream out1, OutputStream out2)
> public void write(byte b[]) throws IOException {
>         if ( out1 != null ) out1.write(b) ;
>         if ( out2 != null ) out2.write(b) ;
>     }
>     ...
> 
> ResponseResultSet.OutputContent can use OutputStream (it does not need 
> ServletOutputStream).
> 
>     OutputStream outServlet = action.response.getOutputStream() ;
>     OutputStream out ;
>     if { writing to cache ) {
>         ByteArrayOutputStream outCatcher = new ByteArrayOutputStream() ;
>         out = new OutputStream2(outServlet, outCatcher) ;
>     } else {
>         out = outServlet ;
>     }
>     ...
>     proc.output(out) ;
> 
> This will work if a new format is added (a Thrift based binary format for 
> example) without needing the format to be aware of cache entry creation.
> 
> It also means the caching is not exposed in the ARQ API.
> 
> Caching and content negotiation
> 
> The Cache key is insensitive to the "Accept" header.
> 
> The format of the output is determined by the "Accept" header. The query 
> string output= is merely a non-standard way to achieve the same thing when it 
> is hard to set the HTTP header (some lightweight scripting libraries).
> 
> The current design writes the same format as the request, and uses the cache 
> but these two operations are different:
> 
> GET /datasets/query=SELECT * { ?s ?p ?o}
> Accept: application/sparql-results+xml
> 
> GET /datasets/query=SELECT * { ?s ?p ?o}
> Accept: application/sparql-results+json
> 
> Cache ResultSet
> 
> (Discussion point) A possibility is that the cache is of a copy of the 
> ResultSet (as java objects).
> 
> Advantages:
> 
>       • cached item is not in a particular format. Content negotiation 
> happens per request.
>       • OFFSET/LIMIT can be applied to the cached results f the original 
> query is executed without OFFSET/LIMIT (weak version of paging).
> See the experimental, sketch-only and out of date sparql-cache for 
> OFFSET/LIMIT processing.
> 
> Disadvantages:
> 
>       • Does not stream
>       • cache entries are rewritten each time they are used.
> An iterator over the result set that captures the output while iterating 
> would address the non-streaming disadvantage.
> 
> Content negotiation happens per request.
> 
> Cache invalidation
> 
> Update operations must invalid the cache. A simple way is to simply 
> invalidate the whole cache. It is very hard to determine whether an update 
> affects a cache entry selectively.
> 
> Configuration and Control
> 
> The cache is hard wired and always on. It may not always be the right choice. 
> There needs to be a way to control it, possibly on a per-dataset basis. Note 
> there is only one SPARQL_Query instance per Fuseki server due to the way 
> dispatch is dynamic.
> 
> Suggestion 1: A servlet SPARQL_Query_Cache catches passes requests to a 
> separate SPARQL_Query. This is the inheritance way to separate cache code 
> from the rest of processing. It works better if OFFSET/LIMIT control is going 
> to be added later.
> 
> Suggestion 2: It is primarily SPARQL_Query::sendResults being caught here so 
> a "result handler" set in SPARQL_Query would allow a separation of cache code 
> into it's own class and just a hook in SPARQL_Query. This is the composition 
> way to separate the cache code from the rest of processing.
> 
> Reuse the Guava already in Jena
> 
> Extend CacheGuava to have a constructor that takes a 
> org.apache.jena.ext.com.google.common.cache.CacheBuilder.
> 
> Background:
> Jena includes a shared Guava 18 in org.apache.jena.ext.com.google.common in 
> artifact jena-shaded-guava. (Apparently Hadoop, the main reason we shade 
> Guava does now in fact work with later Guava's than its dependency on an old 
> Guava version.)
> 
> Extend to CONSTRUCT and DESCRIBE (future)
> 
> It is only for SELECT and ASK queries. See ResponseModel and ResponseDataset 
> for CONSTRUCT and DESCRIBE. The output capture point shoudl make this 
> possible.
> 
> Documentation and tests
> 
> Documentation and tests needed.
> 
> —
> Reply to this email directly or view it on GitHub.
>

Re: [jena] JENA-626 SPARQL Query Caching (#95)

Reply via email to