A quick discussion of ETags in the "backup admin" PR that was sent by Yang 
Yuanzhe led me to this issue:

https://issues.apache.org/jira/browse/JENA-388

for "Make Fuseki responses cacheable" and which has been around for a little 
while. I was wondering about a couple of potential approaches here and thought 
I would run them down:

1) ETag-per-Dataset: this is a single ETag value for any Dataset for all 
requests, updated whenever a mutating request completes. This would work by 
letting any change on a Dataset whatsoever that comes through Fuseki invalidate 
all ETag-based caching on that Dataset. This seems to be where Andy Seaborne 
and Rob Vesse were heading, but I obviously can't speak for them. Advantage: 
relatively simple. Disadvantages: changes in the indexes not performed by 
Fuseki will not be reflected properly, only useful for instances that receive 
the right patterns of changes (meaning for which mutations aren't too "evenly 
sprinkled" amongst queries, thus keeping the cache often invalidated).

2) Constant Expires: Rob Vesse discusses this a bit in the issue. It's an 
Expires header that is configurable to allow some admin adjustment, but is 
constant during runtime. Advantage: dead simple. Disadvantage: unless the usage 
scenario is very tightly controlled, there's going to be some leakage of stale 
data. That may or may not be a big problem for an integrator, depending on use 
case. It would have to be carefully documented, I think, to avoid nasty 
surprises.

3) Per-query ETag: This would be mean some kind of map from request to ETag 
from which ETag headers are supplied for every request. The problem with this 
is that it implies some kind of reasonable algorithm for determining when an 
arbitrary update makes sufficient changes in an arbitrary graph to affect 
another arbitrary query, or it would imply stretching the meaning of "weak" 
ETag to a point that is probably not useful or correct for a query endpoint. 
This doesn't seem very practical.

4) Per-query-for-some-queries ETag. The idea here would be to cut down option 3 
to a tranche of queries for which there actually _does_ exist some reasonable 
algorithm for detecting changes in the query-results. The example that comes to 
mind here would be simple DESCRIBE queries. Since it seems that ARQ deals with 
DESCRIBE using only relationships "outbound" from the things described, this 
approach could use an expiring map from URIs to Etags which could be updated 
(perhaps using a StatementListener) when a change directly affects an URI or a 
blank node in the CBD of that URI. This could be expensive, but it might be 
worth it for some use cases, for example where integrators are using software 
like Pubby to publish RDF. There might be other examples of query pattern where 
changes are practically calculable.

Whether (and how far) any of these are worth pursuing depends a good bit on the 
use case in hand. For example, for my use cases, option 2 isn't really 
practical, because one of the applications taking results from Fuseki would be 
using them to present live-editing pages. Option 1 would work, and it would 
give some advantage. Option 4 isn't interesting because very few of the queries 
in play will be simple DESRIBE queries. But that's all based on my use case.

Do you think any of these are worth pursuing? 

---
A. Soroka
The University of Virginia Library

Reply via email to