Github user afs commented on the pull request:
https://github.com/apache/jena/pull/95#issuecomment-213496231
**Intent and Abstraction**
Fuseki-caching isn't going to beat Vanish so I think the better use of
Fuseki-caching is supporting cases flexibility like (in the future) paging
results.
I've sketched something in a branch in a work area:
https://github.com/afs/jena/tree/fuseki-cache
This is a sketch and not for serious use - there is some quick-and-easy
implementation, it's only lightly tested for one dataset only. No
configurability.
The classes changed are: `HttpAction`, `ResultsCache`, and `SPARQL_Query`
to use the cache. `SPARQL_Query` has operations `processViaCache`,
`prepareForCache`, `insertIntoCache`. It deals with two concurrent attempts to
set the by letting them both run (it's the same answer right?!) and set the
cache.
The cache is invalided when `HttpAction.beginWrite` is called so all update
routes are caught (SPARQL Update, GSP and the Uploader). I don't like that - it
seems asymmetric that `beginWrite` is used and it assumes MR+SW.
Cache actions are logged `** Cache`.
**Space**
If the query result (not the serialization) is stored, I would expect the
memory footprint will be less because of sharing nodes with the original
dataset. Any graph pattern matching variable ends up with the
node-by-reference. Calculated expressions are fresh nodes. Long literals are
shared.
Literals from the data are not extra cost in memory. Let's assume that
calculated nodes are small. This is usually true - but they may be a lot of
them.
The calculation of the memory cost, is now approximated by the total number
of cells in the results, i.e approximate with "num of rows * num of columns"
and it can be calculated while capturing the `ResutlSet` copy. We could put
limits on the size of results sets cached and on total number of cells.
Serialized results can easily sized. They do not share space though.
**Configuration**
We need some configuration control, both server-wide on the `fuseki:Server`
object in config.ttl and on each service. Or use "Context" - caching is import
so my suggestion is have properties to cl;early set values.
The server-wide case is, I think, less important. I suggest putting the
configuration on service, not the dataset, so you can have two different
policies, like cached and not-cached, on the same data.
The default should be "no caching".
The having two services addresses the "cold cache/development" use case.
We should still obey `Pragma: no-cache` and `Cache-control` but there are
quite a lot of options and details so it might be wise to not aim to have
everything for a first release, especially if caching is default off.
#### Other
Related-but-different observation: supporting conditional-GETs would be
very good. Just keep an epoxy number/timestamp for each dataset.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---