[jena-site] branch main updated: Initial documentation of the Service Enhancer plugin (#113)

andy Sat, 20 Aug 2022 04:20:15 -0700

This is an automated email from the ASF dual-hosted git repository.

andy pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/jena-site.git



The following commit(s) were added to refs/heads/main by this push:
     new e59e48c74 Initial documentation of the Service Enhancer plugin (#113)
e59e48c74 is described below

commit e59e48c7478513df6ee3af3afccc2a376b8ffa15
Author: Claus Stadler <[email protected]>
AuthorDate: Sat Aug 20 13:20:03 2022 +0200

    Initial documentation of the Service Enhancer plugin (#113)
---
 source/documentation/query/service_enhancer.md | 540 +++++++++++++++++++++++++
 1 file changed, 540 insertions(+)

diff --git a/source/documentation/query/service_enhancer.md 
b/source/documentation/query/service_enhancer.md
new file mode 100644
index 000000000..22fc187dd
--- /dev/null
+++ b/source/documentation/query/service_enhancer.md
@@ -0,0 +1,540 @@
+---
+title: Extras - Service Enhancer
+---
+
+# Service Enhancer Plugin
+The service enhancer (SE) plugin extends the functionality of the SERVICE 
clause with:
+
+- Bulk requests
+- Correlated joins also known as lateral joins
+- A streaming cache for `SERVICE` requests results which can also cope with 
bulk requests and correlated joins. Furthermore, queries that only differ in 
limit and offset will result
+in cache hits for overlapping ranges. At present, the plugin only ships with 
an in-memory caching provider.
+
+As a fundamental principle, a request making use of `cache` and `bulk` should 
return the exact same result as if
+those settings were omitted. As a consequence runtime result set size 
recognition (RRR) is employed to reveal hidden
+result set limits and ensure that always only the appropriate amount of data 
is returned from the caches.
+
+A correlated join using this plugin is syntactically expressed with `SERVICE 
<loop:> {}`.
+It is a binary operation on two graph patterns:
+The operation "loops" over every binding obtained from evaluation of the 
left-hand-side (lhs) and uses it as an input to substitute the variables of the 
right-hand-side (rhs).
+Afterwards, the substituted rhs is evaluated to sequence of bindings. Each rhs 
binding is subsequently merged with lhs' input binding to produce a solution 
binding of the join.
+
+## Example
+The following query demonstrates the features of the service enhancer.
+It executes as a single remote request to Wikidata:
+
+```sparql
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX wd: <http://www.wikidata.org/entity/>
+SELECT ?s ?l {
+  # The ids below correspond in order to: Apache Jena, Semantic Web, RDF, 
SPARQL, Andy Seaborne
+  VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 }
+ 
+  SERVICE <cache:loop:bulk+5:https://query.wikidata.org/sparql> {
+    SELECT ?l {
+      ?s rdfs:label ?l
+      FILTER(langMatches(lang(?l), 'en'))
+    } ORDER BY ?l LIMIT 1
+  }
+}
+```
+
+<details>
+  <summary markdown="span">Click here to view the rewritten Query</summary>
+
+```sparql
+SELECT  *
+WHERE
+  {   {   { { SELECT  *
+              WHERE
+                { { SELECT  ?l
+                    WHERE
+                      { <http://www.wikidata.org/entity/Q1686799>
+                                  <http://www.w3.org/2000/01/rdf-schema#label> 
 ?l
+                        FILTER langMatches(lang(?l), "en")
+                      }
+                  }
+                  BIND(0 AS ?__idx__)
+                }
+              LIMIT   1
+            }
+          }
+        UNION
+          {   { { SELECT  *
+                  WHERE
+                    { { SELECT  ?l
+                        WHERE
+                          { <http://www.wikidata.org/entity/Q54837>
+                                      
<http://www.w3.org/2000/01/rdf-schema#label>  ?l
+                            FILTER langMatches(lang(?l), "en")
+                          }
+                      }
+                      BIND(1 AS ?__idx__)
+                    }
+                  LIMIT   1
+                }
+              }
+            UNION
+              {   { { SELECT  *
+                      WHERE
+                        { { SELECT  ?l
+                            WHERE
+                              { <http://www.wikidata.org/entity/Q54872>
+                                          
<http://www.w3.org/2000/01/rdf-schema#label>  ?l
+                                FILTER langMatches(lang(?l), "en")
+                              }
+                          }
+                          BIND(2 AS ?__idx__)
+                        }
+                      LIMIT   1
+                    }
+                  }
+                UNION
+                  {   { { SELECT  *
+                          WHERE
+                            { { SELECT  ?l
+                                WHERE
+                                  { <http://www.wikidata.org/entity/Q54871>
+                                              
<http://www.w3.org/2000/01/rdf-schema#label>  ?l
+                                    FILTER langMatches(lang(?l), "en")
+                                  }
+                              }
+                              BIND(3 AS ?__idx__)
+                            }
+                          LIMIT   1
+                        }
+                      }
+                    UNION
+                      { { SELECT  *
+                          WHERE
+                            { { SELECT  ?l
+                                WHERE
+                                  { <http://www.wikidata.org/entity/Q108379795>
+                                              
<http://www.w3.org/2000/01/rdf-schema#label>  ?l
+                                    FILTER langMatches(lang(?l), "en")
+                                  }
+                              }
+                              BIND(4 AS ?__idx__)
+                            }
+                          LIMIT   1
+                        }
+                      }
+                  }
+              }
+          }
+      }
+    UNION
+      # This union member adds an end marker
+      # Its absence in responses is
+      # used to detect result set size limits
+      { BIND(1000000000 AS ?__idx__) }
+  }
+ORDER BY ASC(?__idx__) ?l
+```
+
+Note that in the query above `?s` has been substituted based on the respective 
input bindings (in this case the Wikidata IRIs).
+For every bulk query execution, the SE plugin assigns an increasing ID to 
every input binding (starting from 0). This ID is included in the service 
request via the
+`?__idx__` variable. (If the variable is already used then an unused name is 
allocated by appending a number such as `?__idx__1`).
+Every obtained binding's `?__idx__`  value determines the input binding that 
has to be merged with in order to produce the final binding.
+A special value for `?__idx__` is the  end marker. It is a number higher than 
any input binding ID and it is used to detect result set size limits: It's 
absence in a result set
+means that it was cut off. This information is used to ensure that a request 
using a certain service IRI does not yield more results than limit.
+
+</details>
+
+
+Note that a repeated execution of a query (possibly with different 
limits/offsets) will serve the data from cache rather than making another 
remote request.
+The cache operates on a per-input-binding basis: For instance, in the example 
above it means that when removing bindings from the `VALUES` block data will
+still be served from the cache. Conversely, adding additional bindings to the 
`VALUES` block will only send a (bulk) remote request for those
+that lack cache entries.
+
+## Namespace
+The plugin introduces the namespace `http://jena.apache.org/service-enhancer#` 
which is used for both ARQ context symbols as well as assembler configuration.
+
+## Maven Dependency
+
+```xml
+<dependency>
+    <groupId>org.apache.jena</groupId>
+    <artifactId>jena-serviceenhancer</artifactId>
+    <version><!-- Check the link below for available versions --></version>
+</dependency>
+```
+[Available 
Versions](https://mvnrepository.com/artifact/org.apache.jena/jena-serviceenhancer).
+
+Adding this dependency will automatically initialize the plugin via 
service-loading of 
`org.apache.jena.sparql.service.enhancer.init.ServiceEnhancerInit`
+using Jena's plugin system.
+
+## Programmatic Setup
+Loading the `jena-serviceenhancer` jar file automatically enables bulk 
requests and caching.
+Correlated joins however require explicit activation because they require 
specific algebra transformations to run as part of the query optimization 
process.
+For more details about the transformation see [Programmatic Algebra 
Transformation](#programmatic-algebra-transformation).
+
+The following snippet globally enables correlated joins by overriding the 
context's optimizer:
+```java
+import org.apache.jena.sparql.service.enhancer.init.ServiceEnhancerInit;
+
+ServiceEnhancerInit.wrapOptimizer(ARQ.getContext());
+```
+
+As usual, in order to avoid a global setup, the the context of a dataset or 
statement execution (i.e. query / update) can be used instead:
+```java
+DatasetFactory dataset = DatasetFactory.create();
+ServiceEnhancerInit.wrapOptimizer(dataset.getContext());
+```
+
+The lookup proceduce for which optimizer to wrap first consults the given 
context and then the global one.
+If neither has an optimizer configured then Jena's default one will be used.
+
+Service requests that do not make use of this plugin's options will not be 
affected even if the plugin is loaded.
+The plugin registration makes use of the [custom service executor extension 
system](/documentation/query/custom_service_executors.html).
+
+## Assembler
+The `se:DatasetServiceEnhancer` assembler can be used to enable the SE plugin 
on a dataset.
+This procedure also automatically enables correlated joins using the dataset's 
context as described in [Programmatic Setup](#programmatic-setup).
+By default, the SE assembler alters the base dataset's context and returns the 
base dataset again.
+There is one important exception: If `se:enableMgmt` is true then the 
assembler's final step it to create a wrapped dataset with a copy of the 
original dataset's context where `enableMgmt` is true.
+This way, management functions are not available in the base dataset.
+
+```ttl
+# assembler.ttl
+PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
+PREFIX se: <http://jena.apache.org/service-enhancer#>
+<urn:example:root>
+  a se:DatasetServiceEnhancer ;
+  ja:baseDataset <urn:example:base> ;
+  se:datasetId <https://my.dataset.id/> ; # Defaults to the value of 
ja:baseDataset
+  se:cacheMaxEntryCount 300 ;             # Maximum number of cache entries ;
+                                          # identified by the tuple (service 
IRI, query, input binding)
+  se:cacheMaxPageCount 15 ;               # Maximum number of pages per cache 
entry
+  se:cachePageSize 10000 ;                # Number of bindings per page
+  se:enableMgmt false                     # Enables management functions;
+                                          # wraps the base dataset with an 
independent context
+  .
+
+<urn:example:base> a ja:MemoryDataset .
+```
+
+In the example above, the shown values for `se:cacheMaxEntryCount`, 
`se:cacheMaxPageCount` and `se:cachePageSize` are the defaults which are used 
if those options are left unspecified.
+They allow for caching up to 45mio bindings (300 x 15 x 10000).
+There is one caveat though: Specifying the cache options puts a new a cache 
instance in the dataset's context. Without these options the global cache 
instance that is registered in the ARQ context by the SE plugin during service 
loading is used.
+Presently, the global instance cannot be configured via the assembler.
+
+
+Creating a dataset from the specification above is programmatically 
accomplished as follows:
+```java
+Model spec = RDFDataMgr.load("assembler.ttl");
+Dataset dataset = 
DatasetFactory.assemble(spec.getResource("urn:example:root"));
+```
+
+The value of `se:datasetId` is used to look up caches when referring to the 
active dataset using `SERVICE <urn:x-arq:self> {}`.
+
+### Configuration with Fuseki
+
+#### Adding the Service Enhancer JAR
+This section assumes that one of the distributions of `apache-jena-fuseki` has 
been downloaded from [https://jena.apache.org/download/].
+The extracted folder should contain the `./fuseki-server` executable start 
script which automatically loads all jars (relative to `$PWD`) under 
`run/extra`.
+These folders need to be created e.g. using `mkdir -p run/extra`. The SE 
plugin can be manually built or downloaded from maven central (it is 
self-contained without transitive dependencies).
+Placing it into the `run/extra` folder makes it available for use with Fuseki. 
The plugin and Fuseki version should match.
+
+#### Fuseki Assembler Configuration
+The snippet below shows a simple setup of enabling the SE plugin for a given 
base dataset.
+Cache management can be performed via SPARQL extension functions. However, 
usually not every user should be allowed to invalidate caches as this
+could be exploited for service disruptions. Jena does not directly provide a 
security model for access privileges on functions such as
+known from SQL DBMSs. However, with Fuseki it is possible to create both a 
public and an admin endpoint over the same base dataset:
+
+```ttl
+<#myServicePublic> a fuseki:Service; fuseki:name "test"; fuseki:dataset 
<#myDsPublic> .
+<#myServiceAdmin>  a fuseki:Service; fuseki:name "testAdmin"; fuseki:dataset 
<#myDsAdmin> .
+
+<#myDsPublic>      a se:DatasetServiceEnhancer ; ja:baseDataset <#myDsBase> .
+<#myDsAdmin>       a se:DatasetServiceEnhancer ; ja:baseDataset <#myDsBase> ; 
se:enableMgmt true .
+
+<#myDsBase>        a ja:MemoryDataset .
+```
+
+For configuring access control with Fuseki please refer to [Data Access 
Control for Fuseki](/documentation/fuseki2/fuseki-data-access-control.html).
+
+## Context Symbols
+The service enhancer plugin defines several symbols for configuration via 
context.
+The context symbols are in the namespace 
`http://jena.apache.org/service-enhancer#`.
+
+| Symbol                       | Value type             | Default\* | 
Description |
+|------------------------------|------------------------|-----------|-------------|
+| `enableMgmt`                 | boolean                | false     | This 
symbol must be set to true in the context in order to allow calling certain 
"privileged" SPARQL functions. |
+| `serviceBulkBindingCount`    | int                    | 10        | Number 
of bindings to group into a single bulk request |
+| `serviceBulkMaxBindingCount` | int                    | 100       | Maximum 
number of input bindings to group into a single bulk request; restricts 
`serviceBulkRequestItemCount`. When using `bulk+n` then `n` will be capped to 
the configured value. |
+| `datasetId`                  | String                 | null      | An IRI 
to resolve `urn:x-arq:self` to. Used to discriminate cache entries for 
self-referenced datasets. |
+| `serviceCache`               | ServiceResponseCache   | null      | Symbol 
for the cache of services' result sets |
+| `serviceResultSizeCache`     | ServiceResultSizeCache | null      | Symbol 
for the cache of services' result set sizes |
+
+
+\* The value that is assumed if the symbol is absent.
+
+
+The class 
`org.apache.jena.sparql.service.enhancer.init.ServiceEnhancerConstants` defines 
the constants for programmatic usage.
+As usual, context attributes can be set on global, dataset and query execution 
level:
+```java
+// Global level
+ARQ.getContext().set(ServiceEnhancerConstants.serviceBulkBindingCount, 5);
+
+// Dataset level
+Dataset dataset = DatasetFactory.create();
+dataset.getContext().set(ServiceEnhancerConstants.datasetId, 
"http://example.org/myDatasetId";);
+
+// Query Execution level
+try (QueryExecution qe = QueryExecutionFactory.create(dataset, "SELECT * { ?s 
?p ?o }")) {
+  qe.getContext().set(ServiceEnhancerConstants.enableMgmt, true);
+  // ...
+}
+```
+
+## Service Options
+The service option syntax is used to express a list of key-value pairs 
followed by an optional IRI.
+The first pair must always be terminated by a `:` in order to avoid 
misinterpreting it as a relative IRI which would be resolved against the 
configured base IRI.
+Multiple pairs are separated using `:`. Pairs may be followed by an IRI for 
the service. If it is absent, then the IRI `urn:x-arq:self` is implicitly 
assumed.
+
+```
+(key[+value]:)* (key[+value][:] | IRI)
+```
+
+The special IRI `urn:x-arq:self` is used to refer to the active dataset. This 
is the dataset the query is executed against. If service options are present 
that are not followed by an IRI then this IRI is assumed.
+Consequently, Both e.g. `SERVICE <cache:>` or `SERVICE <bulk:loop>` refer the 
active dataset.
+
+### Bulk Requests
+The `bulk` key enables bulk requests. The default bulk size is based on 
`serviceBulkBindingCount`. It can be overridden using e.g. `SERVICE <bulk+20:> 
{...}`. The specified number is silently capped by `serviceBulkMaxBindingCount`.
+
+Execution of a bulk request proceeds by first taking `n` items from the lhs to 
form a batch.
+Then the bulk query is generated by forming a union where the service's graph 
pattern is substituted with every input binding in the batch as shown in the 
[example](#example).
+
+### Correlated Joins
+Informally, conventional joins in SPARQL are bottom-up such that the result of 
a join is obtained from evaluating the lhs and rhs of a join independently and 
merging all compatible bindings (and discarding the incompatible ones).
+Correlated joins are left-to-right such that each binding obtained from lhs's 
evaluation is used to substitute the rhs prior to its evaluation.
+Correlated joins alter the scoping rules of variables as demonstrated by the 
subsequent two examples.
+
+The following concepts are relevant to understand the how correlated joins are 
dealt with:
+* **Scope rename** SPARQL evaluation has a notion of scoping which determines 
whether a variable will be part of the solution bindings created from a graph 
pattern [as defined here](https://www.w3.org/TR/sparql11-query/#variableScope). 
Jena provides `TransformScopeRename` which renames variables such as their 
names are globally. Jena's scope renaming prepends `/` characters before the 
original variable name so `?x` may become `?/x` or `?//x`. 
`TransformScopeRename` is applied by the defa [...]
+* **Substitution** When evaluating the lhs of a join then the scope renaming 
enables that for each obtained binding all variables on the rhs can be 
substituted with the corresponding values of that binding.
+* **Base name** The base name of a variable is it's name without scoping. For 
example the variables `?x`, `?/x` and `?//x` all have the base name `x`.
+* **Join key** A join key of a join operation is the set of variables that is 
the intersection of lhs' **visible** variables with rhs' **mentioned** ones.
+* **Join binding** A join binding is obtained by projecting an lhs' input 
binding with a join key. It is used to substitute variables on the rhs and is 
part of the key object used in caching.
+
+#### Example of Scoping in a Conventional join
+Consider the following example.
+```sparql
+SELECT ?p ?c {
+  BIND(<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> AS ?p)
+  { SELECT (COUNT(*) AS ?c) { ?s ?p ?o } }
+}
+```
+
+Note that the `?p` on the right hand side becomes scoped as `?/p`. 
Consequently, lhs' `?p`  and rhs' `?/p` are considered different variables.
+```
+(project (?p ?c)
+  (join
+    (extend ((?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>))
+      (table unit))
+    (project (?c)
+      (extend ((?c ?/.0))
+        (group () ((?/.0 (count)))
+          (bgp (triple ?/s ?/p ?/o))))))) # ?/p is different from the ?p on 
the lhs
+```
+Because there is no overlap in the variables on either side of the join the 
join key is the empty set of variables.
+
+#### Example of Scoping in a Correlated Join
+
+The two effects of the `loop:` transform are shown below. First, a `sequence` 
is enforced. And second, the scope of `?p` is now the same on the lhs and rhs.
+```sparql
+SELECT ?p ?c {
+  BIND(<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> AS ?p)
+  SERVICE <loop:> { SELECT (COUNT(*) AS ?c) { ?s ?p ?o } }
+}
+```
+
+The obtained algebra now includes `sequence` instead of `join` and the 
variable `?p` appears on both sides of it:
+```
+(project (?p ?c)
+  (sequence
+    (extend ((?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>))
+      (table unit))
+    (service <loop:>
+      (project (?c)
+        (extend ((?c ?/.0))
+          (group () ((?/.0 (count)))
+            (bgp (triple ?/s ?p ?/o)))))))) # ?p is now the same here and on 
the lhs
+```
+The join key is set containing `?p` because this variable appears on either 
side of the join.
+The lhs will produce a single join binding where `?p` is assigned to 
`rdf:type`.
+
+Upon evaluation, for each input binding of the lhs the `?p` on the rhs is now 
substituted thus giving the count for the specific property.
+Note, that the cache system of this plugin caches per join binding even for 
bulk requests. Hence, use of `SERVICE <loop:cache> {...}` will produce cache 
hits
+for repeated join bindings regardless of the pattern on the lhs.
+
+
+#### Programmatic Algebra Transformation
+In order to make `loop:` work the following machinery is in place:
+
+The algebra transformation implemented by `TransformSE_JoinStrategy` needs to 
run bothe **before** and **after** the **default** algebra optimization.
+The reason is that is does two things:
+* It converts every OpJoin instance with a `loop:` on the right hand side into 
a `OpSequence`.
+* Any **mentioned** variable on the rhs whose base name matches the base name 
of a **visible** variable on the lhs gets substituted by the lhs variable.
+
+```java
+String queryStr = "SELECT ..."; // Put any example query string here
+Transform loopTransform = new TransformSE_JoinStrategy();
+Op op0 = Algebra.compile(QueryFactory.create(queryStr));
+Op op1 = Transformer.transform(loopTransform, op0);
+Op op2 = Optimize.stdOptimizationFactory.create(ARQ.getContext()).rewrite(op1);
+Op op3 = Transformer.transform(loopTransform, op2);
+System.out.println(op3);
+```
+
+### Caching
+Any graph pattern contained in a `SERVICE <cache:> { }` block is subject to 
caching.
+The key of a cache entry is composed of three components:
+
+* The concrete service IRI
+* The input binding that originates from the lhs
+* The (algebra of) the SERVICE clause's graph pattern (the rhs)
+
+The cache is slice-aware: If the rhs corresponds to a SPARQL query making use 
of LIMIT and/or OFFSET then the cache lookup will find any priorly fetched 
overlapping ranges
+and derive a backend request that only fetches the needed parts.
+
+The `cache` service option can be used with the following values:
+* `cache`: Read from cache when possible and write retrieved data to cache
+* `cache+default`: Same as `cache`.
+* `cache+clear`: Clears all cache entries for the current batch of input 
bindings.
+* `cache+off`: Disables use the cache in the query execution
+
+```sparql
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+SELECT * {
+  BIND(rdf:type AS ?p)
+  SERVICE <loop:cache:> {
+    SELECT * {
+      ?s ?p ?o
+    } OFSET 10 LIMIT 10
+    # ^ Altering limit/offset will match overlapping ranges of data in the 
cache
+  }
+}
+```
+
+Note, that in pathological cases this can require a bulk request to be 
repeatedly re-executed with disabled caches for each input binding.
+For example, assume that the largest result yet set seen for a service is 1000 
and the system is about to serve the 1001st binding from cache for a specific 
input binding.
+The question is whether this would exceed the service's so far unknown result 
set size limit. Therefore, in order to answer that question a remote request 
that bypasses the cache is performed.
+Furthermore, let's assume that request produces 2000 results. Then for the 
problem repeats once another input binding's 2001st result was about to be 
served.
+
+### SPARQL Functions
+The service enhancer plugin introduces functions and property functions for 
listing cache content and removing cache entries.
+The namespace is
+
+```
+PREFIX se: <http://jena.apache.org/service-enhancer#>
+```
+
+| Signature                | Description |
+|--------------------------|-------------|
+| `long se:cacheRm()`      | Invalidates all entries from the cache that are 
not currently in use. Returns the number of invalidated entries. |
+| `long se:cacheRm(long)`  | Attempts to remove the given entry. Returns 1 on 
success or 0 otherwise (e.g. entry did not exist or was still in use). |
+| `?id se:cacheLs ([?serviceIri [?queryStr [?inputBindingStr]]])` | Property 
function to list cache content. |
+
+```sparql
+PREFIX sepf: <java:org.apache.jena.sparql.service.enhancer.pfunction.>
+SELECT * WHERE {
+  ?id sepf:cacheLs (?service ?query ?binding)
+}
+```
+
+If e.g. data was cached using the following query, then `se:cacheLs` will 
yield the result set below.
+```sparql
+SELECT * {
+  SERVICE <loop:> {
+    { SERVICE <cache:> {
+      SELECT (<urn:x-arq:DefaultGraph> AS ?g) ?p (COUNT(*) AS ?c) {
+        ?s ?p ?o
+      } GROUP BY ?p
+    } }
+  UNION
+    { SERVICE <cache:> {
+      SELECT ?g ?p (COUNT(*) AS ?c) {
+        GRAPH ?g { ?s ?p ?o }
+      } GROUP BY ?g ?p
+    } }
+  }
+
+  # FILTER(CONTAINS(STR(?g), 'filter over ?g'))
+  # FILTER(CONTAINS(STR(?p), 'filter over ?p'))
+} order by DESC(?c) ?g ?p
+```
+
+```
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+| id | service                           | query                               
                                                                  | binding     
        |
+========================================================================================================================================================================
+| 2  | "urn:x-arq:self@dataset813601419" | "SELECT  (<urn:x-arq:DefaultGraph> 
AS ?g) ?p (count(*) AS ?c)\nWHERE\n  { ?s  a  ?o }\nGROUP BY ?p\n" | "( ?p = 
rdf:type )" |
+| 3  | "urn:x-arq:self@dataset813601419" | "SELECT  ?g ?p (count(*) AS 
?c)\nWHERE\n  { GRAPH ?g\n      { ?s  a  ?o }\n  }\nGROUP BY ?g ?p\n"     | "( 
?p = rdf:type )" |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+```
+
+#### Example: Invaliding all cache entries
+```sparql
+PREFIX se: <http://jena.apache.org/service-enhancer#>
+SELECT (se:cacheRm() AS ?count) { }
+```
+
+#### Example: Invalidating specific cache entries
+```sparql
+PREFIX se: <http://jena.apache.org/service-enhancer#>
+
+SELECT SUM(se:cacheRm(?id) AS ?count) {
+  ?id se:cacheList (<http://dbpedia.org/sparql>)
+}
+```
+
+For completeness, the functions can be addressed via their fully qualified 
Java class names:
+```
+<java:org.apache.jena.sparql.service.enhancer.pfunction.cacheLs>
+<java:org.apache.jena.sparql.service.enhancer.function.cacheRm>
+```
+
+## Limitations, Troubleshooting and Pitfalls 
+
+### Storing Caches to Disk
+At present the plugin only ships with an in-memory implementation of the 
cache. Custom storage strategies can be implemented based one the interface 
`Slice`.
+A file-based storage system is expected to be shipped with a later version of 
the SE plugin.
+
+### Caching with Virtuoso
+There is a bug in Virtuoso that causes queries making use of DISTINCT a with 
non-zero OFFSET without LIMIT to fail.
+The remainder shows how the SE plugin may unexpectedly fail due to it and 
shows a workaround.
+
+The following query will cause caching of the first 10 results:
+```sparql
+SELECT <cache:http://dbpedia.org/sparql> { SELECT DISTINCT ?s { ?s a ?o } 
ORDER BY ?s LIMIT 10 }
+```
+
+Executing the the following query afterwards will fail:
+```sparql
+SELECT <cache:http://dbpedia.org/sparql> { SELECT DISTINCT ?s { ?s a ?o } 
ORDER BY ?s }
+```
+
+The reason is that the first 10 results will be read from cache and the actual 
query sent as a remote request is:
+```sparql
+SELECT <cache:http://dbpedia.org/sparql> { SELECT DISTINCT ?s { ?s a ?o } 
ORDER BY ?s OFFSET 10 }
+```
+Thus we end up with a query using DISTINCT with a non-zero offset and without 
LIMIT.
+
+
+As a workaround, note that if the service enhancer plugin detects a result set 
size limit then it will inject it in remote requests.
+In such cases, executing the query `SELECT * { SERVICE 
<http://dbpedia.org/sparql> { ?s ?p ?o } }` once will make the result set size 
limit known
+(at the time of writing DBpedia was configured with a limit of 10000), and 
therefore the modified request becomes
+
+```sparql
+SELECT <cache:http://dbpedia.org/sparql> { SELECT DISTINCT ?s { ?s a ?o } 
ORDER BY ?s OFFSET 10 LIMIT 10000 }
+```
+
+### Order of Bindings differ between Cache and Remote Reads
+In practice, many triple store engines return the same response for the same 
graph pattern / query over the same physical database even if ordering is 
absent.
+As can be seen from [example](#example), bulk requests result in a union which 
are sorted by the serial numbers assigned to the input bindings.
+However, SPARQL does not mandate stable sorting, therefore this approach may 
cause bindings with the same serial number to become 'shuffled'.
+The solution is to is to include sort sufficient conditions in the `SERVICE`'s 
graph pattern. The bulk query will include those sort conditions after the 
serial number sort condition.
+
+

[jena-site] branch main updated: Initial documentation of the Service Enhancer plugin (#113)

Reply via email to