[GitHub] jena pull request: JENA-626 SPARQL Query Caching

afs Mon, 19 Oct 2015 03:43:04 -0700

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/95#issuecomment-149181341
  
    Review of JENA-626
    
    These comments relate to the architecture of the proposal.
    
    
    **Capturing output**
    **Capturing output**
    
    ResponseResultSet/ResultSetFormatter.
    
    Also see the discussion point below for a different approach to caching 
that is not based on capturing the written output.
    
    The current design requires changes to `ResultSetFormatter` and every 
output format. 
    
    An alternative create a replicating `OutputStream` which can output to two 
places, one of which can be the string capture.  This localises changes to 
Fuseki only.
    
    Sketch (there may be better ways to achieve the same effect).
    ```
    class OutputStream2 extends OutputStream
    public OutputStream2(OutputStream out1, OutputStream out2)
    public void write(byte b[]) throws IOException {
            if ( out1 != null ) out1.write(b) ;
            if ( out2 != null ) out2.write(b) ;
        }
        ...
    ```
    `ResponseResultSet.OutputContent` can use `OutputStream` (it does not need 
`ServletOutputStream`).
    ```
        OutputStream outServlet = action.response.getOutputStream() ;
        OutputStream out ;
        if { writing to cache ) {
            ByteArrayOutputStream outCatcher = new ByteArrayOutputStream() ;
            out = new OutputStream2(outServlet, outCatcher) ;
        } else {
            out = outServlet ;
        }
        ...
        proc.output(out) ;
    ```
    This will work if a new format is added (a Thrift based binary format for 
example) without needing the format to be aware of cache entry creation.
    
    It also means the caching is not exposed in the ARQ API.
    
    **Caching and content negotiation**
    
    The Cache key is insensitive to the "Accept" header.
    
    The format of the output is determined by the "Accept" header.  The query 
string `output=` is merely a non-standard way to achieve the same thing when it 
is hard to set the HTTP header (some lightweight scripting libraries).
    
    The current design writes the same format as the request, and uses the 
cache but these two operations are different:
    ```
    GET /datasets/query=SELECT * { ?s ?p ?o}
    Accept: application/sparql-results+xml
    ```
    
    ```
    GET /datasets/query=SELECT * { ?s ?p ?o}
    Accept: application/sparql-results+json
    ```
    
    **Cache `ResultSet`**
    
    (Discussion point) A possibility is that the cache is of a copy of the 
ResultSet (as java objects).
    
    Advantages:
    * cached item is not in a particular format. Content negotiation happens 
per request.
    * OFFSET/LIMIT can be applied to the cached results f the original query is 
executed without OFFSET/LIMIT (weak version of paging). See the experimental, 
sketch-only and out of 
date[sparql-cache](https://svn.apache.org/repos/asf/jena/Experimental/sparql-cache/).
    
    Disadvantages:
    * Does not stream
    * cache entries are rewritten each time they are used. 
    
    In fact, an iterator over the result set that captures the output while 
iterating would address the non-streaming disadvantage.
    
    Content negotiation happens per request.
    
    **Cache invalidation**
    
    Update operations must invalid the cache. A simple way is to simply 
invalidate the whole cache.  It is very hard to determine whether an update 
affects a cache entry selectively.
    
    **Configuration and Control**
    
    The cache is hard wired and always on.  It may not always be the right 
choice.  There needs to be a way to control it, possibly on a per-dataset 
basis.  Note there is only one SPARQL_Query instance per Fuseki server due to 
the way dispatch is dynamic.
    
    Suggestion 1: A servlet `SPARQL_Query_Cache` catches passes requests to a 
separate `SPARQL_Query`.  This is the inheritance way to separate cache code 
from the rest of processing.  It works better if OFFSET/LIMIT control is going 
to be added later.
    
    Suggestion 2: It is primarily `SPARQL_Query::sendResults` being caught here 
so a "result handler" set in `SPARQL_Query` would allow a separation of cache 
code into it's own class and just a hook in `SPARQL_Query`. This is the 
composition way to separate the cache code from the rest of processing.
    
    **Reuse the Guava already in Jena**  
    
    Extend CacheGuava to have a constructor that takes a 
`org.apache.jena.ext.com.google.common.cache.CacheBuilder`.
    
    Jena includes a shared Guava 18 in `org.apache.jena.ext.com.google.common` 
in artifact `jena-shaded-guava`.  In fact, I learnt recently that Hadoop, the 
main reason we shade Guava does now in fact work with later Guava's thatn its 
dependency on Guava 11.)
    
    ### Extend to CONSTRUCT and DESCRIBE (future)
    It is only for SELECT and ASK queries.  See `ResponseModel` and 
`ResponseDataset` for CONSTRUCT and DESCRIBE.  The output capture point shoudl 
make this possible.
    
    **Documentation and tests**
    
    Documentation and tests needed.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-626 SPARQL Query Caching

Reply via email to