Hi Andy & Jena development community,
(Answers inline - apologies if I repeat myself)
FYI - Our aim is to enable end-users to make SPARQL queries whilst
respecting visibility restrictions.
I.e. users (indirectly) add sets of related triples to a dataset and
they can choose who has visibility (beyond themselves) over these,
either: Nobody, Everyone or a chosen set (which can be updated). Note
that this restriction is not by a specific subject or predicate.
(Although the sets of triples do have relationships - not all of them
are known in advance.)
On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <a...@apache.org> wrote:
JENA-2339
PR#1441
https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
tl;dr:
It is a different role for Fuseki.
Fuseki execute the security but the setup and control is from a trusted
external server on the request execution path.
It assumes certain deployment environments to be safe.
FYI - In our case this means that we have a "make SPARQL query" API
call. When received, the applicable user (our domain) is known and, in
the proposed PR, we can prepend the set of allowed graphs to the query
(which have been looked up prior to query execution, externally). The
end user has NO direct access to Fuseki itself.
My feeling is that we should make Fuseki configurable enough so that a
downstream 3rd party can add their security solution that is suitable
for their environment. But we should not incorporate a particular
security solution that relies on the deployment environment.
----
I've asked for more information about the claim on a performance
motivator and some other background information.
The usage patterns are not yet clear. The data is described as "a one
graph per handful of subjects and their properties" and "100s of
graphs". What the queries are is unstated.
Right now, each graph has in the range of 300-500 triples (though the
amount depends on how much additional/domain-specific metadata
end-users choose to add) and the scale of deployed Fuseki datasets
range from having a few to ~6k graphs.
Since we'd like to allow end-users to run **any** queries they wish
(we enforce query timeouts), it's difficult to give concrete examples.
I can however say that TDB unionDefaultGraph mode is enabled (i.e.
most end-users won't choose to explicitly target a specific graph) and
that one of our representative "search" queries (which combines
GeoSPARQL + multiple explicit property matching across multiple
different subjects in a UNION + subsequent collection of mandatory &
optional fields) is between 20-40% faster than the current custom
solution.
(Note that we have also tried query re-writing to insert FROM/FROM
NAMED clauses - and that is very slow in comparison, presumably to the
higher level filtering involved, unlike the quad filter herein.)
There is no characterisation of the queries being made. If we are
talking about overheads, the cases of a few big queries and many small
queries are different.
(pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
adding a certain set of graphs makes the queries on my laptop take:
~600 graphs ~115ms
~1500 graphs ~162ms
~3k graphs ~240ms
~6k graphs ~400ms
The scale looks small (less than a million triples of triples -
approximating as 100 graphs * 1000 triples). That makes the point about
access to TDB hooks a bit redundant.
The dataset I've tested this with has ~1.8M triples. That's not to say
this is the scale we're hoping to satisfy - that's the just what I
tested with first. By redundant, do you mean an alternative approach
should be used for this scale?
There is are distinguished users. A request from one of these users
causes the set of visible graphs to be read from a comment at the start
of the query text in the request.
The use of large numbers of small named graphs to manage security
settings looks to me like triple-level security. I have already
mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
is triple level attribute-based security.
It could well be that I'm seeing the wrong solution for the feature
we're trying to support (that's the other reason for reaching out to
the community. The reason (rightly or wrongly) to model this as a set
of graphs is: Each set of triples to be restricted are related, but
span multiple subjects and could also relate to other subjects in
other sets (as well as externally).
Hence I couldn't see how e.g. Jena Permissions could be applied here:
When you're provided with a single triple to check - you would have to
understand what type subject it is and how it relates to the "top
level" subject to which the ACL applies. Bundling everything into a
graph seemed like viable option.
Concern 1:
This by passes Fuseki-provided security and puts the control function
outside the Fuseki server in a separate server that is not part of Jena.
It will only be secure if deployed in a constrained network environment.
This is not secure except when run in a certain way and, personally, I
don't want to have to deal with a CVE because of that. CVE handling is
time consuming.
I don't see why it is using jena-access (the named graph security
feature) except for the filtering on TDB. It is creating a dynamic
dataset for the query.
You're right - it's only as secure as the middleware/proxy/whatever in
front of it which supplies the ACL. (This was never intended to be
used/exposed to end-users directly.)
The purpose of extending jena-access (instead of immediately writing
it as a separate module) was to illustrate with minimal code changes
(+ extension of existing tests) what it could look like, for
discussion. (The quad filtering / performance aspect would be the
same, regardless of location, I presume.)
Concern 2: How does update fit into the picture? (GSP is not supported).
I thought that, since GSP operations target a single graph, there is
no need to extend support to it since it's already possible to
restrict visibility (with the graph query parameter). Am I missing
something?
Concern 3: It looks like a specific solution for a specific scenario.
Will it get uptake by the wide Jena user community?
It's definitely specific. My thinking was that, if a subset of this
were deemed useful, then it'd be better to exist as part of the core
offering as opposed to us just bolting it on ourselves (at my job).
But, if that's not the case - fair enough.
Concern 4: Is there long-term support and maintenance for the feature?
(e.g. 5y+)
How do we respond to users@ message about it? Is it experimental code or
has it been used for real? Is the feature set stable?
My understanding is that jena-access is classed as stable (we're using
it for something else already in production) and thus, since this
merely produces a SecurityContext with a larger set of graphs, would
theoretically be no less stable.
Opinion: it is not unreasonable to provide support for this kind of
customization of Fuseki.
An extension can then provide whatever security is needed for the
situation and it is the Fuseki user/operator making the decisions about
what is acceptable security and what isn't.
Fuseki has ways to add custom processors and this seems the way to
provide an alternative way to make queries.
Putting it in the distribution codebase is a big step for the project.
At the very least, it needs to be mature and likely to be used.
We wouldn't be reaching out if we weren't likely to want to use such a
feature. All these concerns/questions/suggestions are exactly what we
were hoping for. If I can provide any more context/tests/samples, let
me know.
(I completely get the concerns about diluting a known security feature
and have no issue with something like this being a separate
component.)
Background: Currently jena-access is in Fuseki main. It is not optional
because it predates Fuseki modules.
Andy
--
Vilnis Termanis
Technical Specialist
e | vilnis.terma...@iotics.com
www.iotics.com