[GitHub] jena pull request: JENA-626 SPARQL Query Caching

afs Fri, 22 Apr 2016 09:18:12 -0700

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/95#issuecomment-213496231
  
    **Intent and Abstraction**
    
    Fuseki-caching isn't going to beat Vanish so I think the better use of 
Fuseki-caching is supporting cases flexibility like (in the future) paging 
results.
    
    I've sketched something in a branch in a work area:
    
    https://github.com/afs/jena/tree/fuseki-cache
    
    This is a sketch and not for serious use  - there is some quick-and-easy 
implementation, it's only lightly tested for one dataset only. No 
configurability.
    
    The classes changed are: `HttpAction`, `ResultsCache`, and `SPARQL_Query` 
to use the cache. `SPARQL_Query` has operations `processViaCache`, 
`prepareForCache`, `insertIntoCache`. It deals with two concurrent attempts to 
set the by letting them both run (it's the same answer right?!) and set the 
cache.
    
    The cache is invalided when `HttpAction.beginWrite` is called so all update 
routes are caught (SPARQL Update, GSP and the Uploader). I don't like that - it 
seems asymmetric that `beginWrite` is used and it assumes MR+SW.
    
    Cache actions are logged `** Cache`.
    
    **Space**
    
    If the query result (not the serialization) is stored, I would expect the 
memory footprint will be less because of sharing nodes with the original 
dataset.  Any graph pattern matching variable ends up with the 
node-by-reference.  Calculated expressions are fresh nodes. Long literals are 
shared.
    
    Literals from the data are not extra cost in memory. Let's assume that 
calculated nodes are small.  This is usually true - but they may be a lot of 
them.
    
    The calculation of the memory cost, is now approximated by the total number 
of cells in the results, i.e approximate with "num of rows * num of columns" 
and it can be calculated while capturing the `ResutlSet` copy.  We could put 
limits on the size of results sets cached and on total number of cells.
    
    Serialized results can easily sized.  They do not share space though.
    
    **Configuration**
    
    We need some configuration control, both server-wide on the `fuseki:Server` 
object in config.ttl and on each service.  Or use "Context" - caching is import 
so my suggestion is have properties to cl;early set values.  
    
    The server-wide case is, I think, less important. I suggest putting the 
configuration on service, not the dataset, so you can have two different 
policies, like cached and not-cached, on the same data. 
    
    The default should be "no caching".
    
    The having two services addresses the "cold cache/development" use case.
    
    We should still obey `Pragma: no-cache` and `Cache-control` but there are 
quite a lot of options and details so it might be wise to not aim to have 
everything for a first release, especially if caching is default off.
    
    #### Other
    
    Related-but-different observation: supporting conditional-GETs would be 
very good.  Just keep an epoxy number/timestamp for each dataset.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-626 SPARQL Query Caching

Reply via email to