On 04/09/12 07:29, Claude Warren wrote:
+1

I have to agree that this is a nice simplification of the jena complexity.
  It would be nice to know why they were created in the first place, just to
ensure that those issues are accounted for.  However, I don't see any
reason not to do this and several reasons to proceed.

Claude

Good question.

What I want to do is simply and reduce the Graph layer. Graph/Triple/Node is a key abstraction for extension both downwards (storage, inference) and upwards (Model, query, client).

I can give my personal, looking-back perspective and remembering I wasn't there right at the beginning of the Model API.

And we learn - sometimes things looked to be the right thing at the time but don't always turn out as expected either because a design didn't work out (internal) or the world has gone in a different direction (external).

These features here aren't used or are used so little that they create complexity for an extension and for maintenance with very little benefit.


BulkUpdateHandler falls into the internal category. Batching changes was obviously important right from the very first database backed storage layer (before even RDB) because doing in a batch can be cheaper than doing them one at a time (e.g. JDBC commit around a batch is much cheaper that a commit for every triple).

BulkUpdateHandler does not meet the needs for that:

1/ The batch size is driven from the client but the correct size is a matter for the storage if batching matters at all.

2/ It complicates each application to manage the batching when it could be done once in the graph implementation if it matters. For a library function, like a parser, to know the right batching is hard and probably messes up it's API.

Streaming + storage-side internal batching is better.

So keep the operations that have some practical use, for example, adding Graph.removeAll, and don't put it off to one side. It can still be overridden.


Reification:

Semweb has moved on and reification is not important - quoting one triple leaves the issue of grouping of quoted triples together and often fact-units come in the form of more than one triple. Named graphs are playing the role for quoted facts - named graph post date reification.

The number of uses of it outside "standard" is very low. "standard" can be done in code over a store of triples; the other modes "minimal" and "convenient" need some state to be kept.

http://jena.apache.org/documentation/notes/reification.html#reification-styles

(most of the rest of the documentation remains - the Model API is onyl affected in that there is only one style).

Keeping the state is an implementation cost and complexity especially for persistent storage layers. Quite a lot of effort for the RDB layer went into reification.

So maintain the interface at the Model level - make Graph simpler.


graph.QueryHandler (qQH):

Once up to a time there was RDQL and an RDQL query is, in SPARQL terms, a basic graph patterns, a filter and a projection and nothing else. qQH does that. SPARQL is a bit more complicated. qQH isn't the right building block for SPARQL - it's execution API doesn't extend well into a larger framework so we have ended up with some duplication.

So remove it. It all goes to making graph simpler - and Graph is a key abstraction for extension.

        Andy


On Mon, Sep 3, 2012 at 6:33 PM, Andy Seaborne <[email protected]> wrote:

As part of wanting to tidy up and reduce the "core" of Jena, I'd like to
propose we

   Remove BulkUpdateHandler interface
     Migrate it's few useful operation to Graph.

   Start to provide reification with "standard" only.
     graph.QueryHandler only used to support reification.


== BulkUpdateHandler

The two implementations I know of are

  SimpleBulkUpdateHandler
  UpdateHandlerSDB

A few of it's operations are useful but most turn into nothing but loops
to call add(Triple)/delete(Triple).

Event handling details each operation kind but, as far as I can see, this
becomes individual calls to an "addedStatement"/"**removedStatement" at
the Model level i.e. the different between adding by array or list or
iterator gets lost.

The useful operations are:
   add(Graph)
   delete(Graph)
   removeAll()
   remove(s,p,o)

and the slightly bizarre:

   add(Graph, withReifications)
   delete(Graph, withReifications)

(see below about reification)

and the less useful (because they don't relate to the way the storage
might properly batch changes - the provider shouldn't decide the batch
boundaries) which turn into add(Triple)/delete(Triple)

   add(Triple [])
   add( List<Triple>)
   add( Iterator<Triple>)
   delete(Triple [])
   delete( List<Triple>)
   delete( Iterator<Triple>)

The only calls to these "add" operations are from ARP which batches it's
changes into units of 1000, but not a whole parser run. As the
SimpleBulkUpdate handler turns these into single calls, nothing gained.

My proposal is that the useful operations are moved to Graph, the code for
the withReifications forms migrate to the only callers in ModelCom.

UpdateHandlerSDB:

This only uses the UpdateHandler interface to wrap the calls in
start/finish bulk update to implicitly increase the scope of bulk updates.
  But it isn't

== Reification

The intent is to only support the default standard eventually.

Standard can be provided by code, with no retained state (partial
reificiations).  TDB and SDB do not support anything except "standard".

This leads to ....

(graph.)QueryHandler:
It's main use is with reification.  I think we can remove it when
reification is replaced by a straight code implications.

         Andy

See also JENA-189





Reply via email to