On 04/09/12 07:29, Claude Warren wrote:
+1
I have to agree that this is a nice simplification of the jena complexity.
It would be nice to know why they were created in the first place, just to
ensure that those issues are accounted for. However, I don't see any
reason not to do this and several reasons to proceed.
Claude
Good question.
What I want to do is simply and reduce the Graph layer.
Graph/Triple/Node is a key abstraction for extension both downwards
(storage, inference) and upwards (Model, query, client).
I can give my personal, looking-back perspective and remembering I
wasn't there right at the beginning of the Model API.
And we learn - sometimes things looked to be the right thing at the time
but don't always turn out as expected either because a design didn't
work out (internal) or the world has gone in a different direction
(external).
These features here aren't used or are used so little that they create
complexity for an extension and for maintenance with very little benefit.
BulkUpdateHandler falls into the internal category. Batching changes
was obviously important right from the very first database backed
storage layer (before even RDB) because doing in a batch can be cheaper
than doing them one at a time (e.g. JDBC commit around a batch is much
cheaper that a commit for every triple).
BulkUpdateHandler does not meet the needs for that:
1/ The batch size is driven from the client but the correct size is a
matter for the storage if batching matters at all.
2/ It complicates each application to manage the batching when it could
be done once in the graph implementation if it matters. For a library
function, like a parser, to know the right batching is hard and probably
messes up it's API.
Streaming + storage-side internal batching is better.
So keep the operations that have some practical use, for example, adding
Graph.removeAll, and don't put it off to one side. It can still be
overridden.
Reification:
Semweb has moved on and reification is not important - quoting one
triple leaves the issue of grouping of quoted triples together and often
fact-units come in the form of more than one triple. Named graphs are
playing the role for quoted facts - named graph post date reification.
The number of uses of it outside "standard" is very low. "standard" can
be done in code over a store of triples; the other modes "minimal" and
"convenient" need some state to be kept.
http://jena.apache.org/documentation/notes/reification.html#reification-styles
(most of the rest of the documentation remains - the Model API is onyl
affected in that there is only one style).
Keeping the state is an implementation cost and complexity especially
for persistent storage layers. Quite a lot of effort for the RDB layer
went into reification.
So maintain the interface at the Model level - make Graph simpler.
graph.QueryHandler (qQH):
Once up to a time there was RDQL and an RDQL query is, in SPARQL terms,
a basic graph patterns, a filter and a projection and nothing else. qQH
does that. SPARQL is a bit more complicated. qQH isn't the right
building block for SPARQL - it's execution API doesn't extend well into
a larger framework so we have ended up with some duplication.
So remove it. It all goes to making graph simpler - and Graph is a key
abstraction for extension.
Andy
On Mon, Sep 3, 2012 at 6:33 PM, Andy Seaborne <[email protected]> wrote:
As part of wanting to tidy up and reduce the "core" of Jena, I'd like to
propose we
Remove BulkUpdateHandler interface
Migrate it's few useful operation to Graph.
Start to provide reification with "standard" only.
graph.QueryHandler only used to support reification.
== BulkUpdateHandler
The two implementations I know of are
SimpleBulkUpdateHandler
UpdateHandlerSDB
A few of it's operations are useful but most turn into nothing but loops
to call add(Triple)/delete(Triple).
Event handling details each operation kind but, as far as I can see, this
becomes individual calls to an "addedStatement"/"**removedStatement" at
the Model level i.e. the different between adding by array or list or
iterator gets lost.
The useful operations are:
add(Graph)
delete(Graph)
removeAll()
remove(s,p,o)
and the slightly bizarre:
add(Graph, withReifications)
delete(Graph, withReifications)
(see below about reification)
and the less useful (because they don't relate to the way the storage
might properly batch changes - the provider shouldn't decide the batch
boundaries) which turn into add(Triple)/delete(Triple)
add(Triple [])
add( List<Triple>)
add( Iterator<Triple>)
delete(Triple [])
delete( List<Triple>)
delete( Iterator<Triple>)
The only calls to these "add" operations are from ARP which batches it's
changes into units of 1000, but not a whole parser run. As the
SimpleBulkUpdate handler turns these into single calls, nothing gained.
My proposal is that the useful operations are moved to Graph, the code for
the withReifications forms migrate to the only callers in ModelCom.
UpdateHandlerSDB:
This only uses the UpdateHandler interface to wrap the calls in
start/finish bulk update to implicitly increase the scope of bulk updates.
But it isn't
== Reification
The intent is to only support the default standard eventually.
Standard can be provided by code, with no retained state (partial
reificiations). TDB and SDB do not support anything except "standard".
This leads to ....
(graph.)QueryHandler:
It's main use is with reification. I think we can remove it when
reification is replaced by a straight code implications.
Andy
See also JENA-189