Github user afs commented on the pull request:
https://github.com/apache/jena/pull/95#issuecomment-149181341
Review of JENA-626
These comments relate to the architecture of the proposal.
**Capturing output**
**Capturing output**
ResponseResultSet/ResultSetFormatter.
Also see the discussion point below for a different approach to caching
that is not based on capturing the written output.
The current design requires changes to `ResultSetFormatter` and every
output format.
An alternative create a replicating `OutputStream` which can output to two
places, one of which can be the string capture. This localises changes to
Fuseki only.
Sketch (there may be better ways to achieve the same effect).
```
class OutputStream2 extends OutputStream
public OutputStream2(OutputStream out1, OutputStream out2)
public void write(byte b[]) throws IOException {
if ( out1 != null ) out1.write(b) ;
if ( out2 != null ) out2.write(b) ;
}
...
```
`ResponseResultSet.OutputContent` can use `OutputStream` (it does not need
`ServletOutputStream`).
```
OutputStream outServlet = action.response.getOutputStream() ;
OutputStream out ;
if { writing to cache ) {
ByteArrayOutputStream outCatcher = new ByteArrayOutputStream() ;
out = new OutputStream2(outServlet, outCatcher) ;
} else {
out = outServlet ;
}
...
proc.output(out) ;
```
This will work if a new format is added (a Thrift based binary format for
example) without needing the format to be aware of cache entry creation.
It also means the caching is not exposed in the ARQ API.
**Caching and content negotiation**
The Cache key is insensitive to the "Accept" header.
The format of the output is determined by the "Accept" header. The query
string `output=` is merely a non-standard way to achieve the same thing when it
is hard to set the HTTP header (some lightweight scripting libraries).
The current design writes the same format as the request, and uses the
cache but these two operations are different:
```
GET /datasets/query=SELECT * { ?s ?p ?o}
Accept: application/sparql-results+xml
```
```
GET /datasets/query=SELECT * { ?s ?p ?o}
Accept: application/sparql-results+json
```
**Cache `ResultSet`**
(Discussion point) A possibility is that the cache is of a copy of the
ResultSet (as java objects).
Advantages:
* cached item is not in a particular format. Content negotiation happens
per request.
* OFFSET/LIMIT can be applied to the cached results f the original query is
executed without OFFSET/LIMIT (weak version of paging). See the experimental,
sketch-only and out of
date[sparql-cache](https://svn.apache.org/repos/asf/jena/Experimental/sparql-cache/).
Disadvantages:
* Does not stream
* cache entries are rewritten each time they are used.
In fact, an iterator over the result set that captures the output while
iterating would address the non-streaming disadvantage.
Content negotiation happens per request.
**Cache invalidation**
Update operations must invalid the cache. A simple way is to simply
invalidate the whole cache. It is very hard to determine whether an update
affects a cache entry selectively.
**Configuration and Control**
The cache is hard wired and always on. It may not always be the right
choice. There needs to be a way to control it, possibly on a per-dataset
basis. Note there is only one SPARQL_Query instance per Fuseki server due to
the way dispatch is dynamic.
Suggestion 1: A servlet `SPARQL_Query_Cache` catches passes requests to a
separate `SPARQL_Query`. This is the inheritance way to separate cache code
from the rest of processing. It works better if OFFSET/LIMIT control is going
to be added later.
Suggestion 2: It is primarily `SPARQL_Query::sendResults` being caught here
so a "result handler" set in `SPARQL_Query` would allow a separation of cache
code into it's own class and just a hook in `SPARQL_Query`. This is the
composition way to separate the cache code from the rest of processing.
**Reuse the Guava already in Jena**
Extend CacheGuava to have a constructor that takes a
`org.apache.jena.ext.com.google.common.cache.CacheBuilder`.
Jena includes a shared Guava 18 in `org.apache.jena.ext.com.google.common`
in artifact `jena-shaded-guava`. In fact, I learnt recently that Hadoop, the
main reason we shade Guava does now in fact work with later Guava's thatn its
dependency on Guava 11.)
### Extend to CONSTRUCT and DESCRIBE (future)
It is only for SELECT and ASK queries. See `ResponseModel` and
`ResponseDataset` for CONSTRUCT and DESCRIBE. The output capture point shoudl
make this possible.
**Documentation and tests**
Documentation and tests needed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---