A quick discussion of ETags in the "backup admin" PR that was sent by Yang Yuanzhe led me to this issue:
https://issues.apache.org/jira/browse/JENA-388 for "Make Fuseki responses cacheable" and which has been around for a little while. I was wondering about a couple of potential approaches here and thought I would run them down: 1) ETag-per-Dataset: this is a single ETag value for any Dataset for all requests, updated whenever a mutating request completes. This would work by letting any change on a Dataset whatsoever that comes through Fuseki invalidate all ETag-based caching on that Dataset. This seems to be where Andy Seaborne and Rob Vesse were heading, but I obviously can't speak for them. Advantage: relatively simple. Disadvantages: changes in the indexes not performed by Fuseki will not be reflected properly, only useful for instances that receive the right patterns of changes (meaning for which mutations aren't too "evenly sprinkled" amongst queries, thus keeping the cache often invalidated). 2) Constant Expires: Rob Vesse discusses this a bit in the issue. It's an Expires header that is configurable to allow some admin adjustment, but is constant during runtime. Advantage: dead simple. Disadvantage: unless the usage scenario is very tightly controlled, there's going to be some leakage of stale data. That may or may not be a big problem for an integrator, depending on use case. It would have to be carefully documented, I think, to avoid nasty surprises. 3) Per-query ETag: This would be mean some kind of map from request to ETag from which ETag headers are supplied for every request. The problem with this is that it implies some kind of reasonable algorithm for determining when an arbitrary update makes sufficient changes in an arbitrary graph to affect another arbitrary query, or it would imply stretching the meaning of "weak" ETag to a point that is probably not useful or correct for a query endpoint. This doesn't seem very practical. 4) Per-query-for-some-queries ETag. The idea here would be to cut down option 3 to a tranche of queries for which there actually _does_ exist some reasonable algorithm for detecting changes in the query-results. The example that comes to mind here would be simple DESCRIBE queries. Since it seems that ARQ deals with DESCRIBE using only relationships "outbound" from the things described, this approach could use an expiring map from URIs to Etags which could be updated (perhaps using a StatementListener) when a change directly affects an URI or a blank node in the CBD of that URI. This could be expensive, but it might be worth it for some use cases, for example where integrators are using software like Pubby to publish RDF. There might be other examples of query pattern where changes are practically calculable. Whether (and how far) any of these are worth pursuing depends a good bit on the use case in hand. For example, for my use cases, option 2 isn't really practical, because one of the applications taking results from Fuseki would be using them to present live-editing pages. Option 1 would work, and it would give some advantage. Option 4 isn't interesting because very few of the queries in play will be simple DESRIBE queries. But that's all based on my use case. Do you think any of these are worth pursuing? --- A. Soroka The University of Virginia Library
