Re: Streaming SPARQL Updates

Stephen Allen Fri, 01 Feb 2013 13:06:48 -0800

Replying to both Andy and Rob in this email, comments inline.

So this part about what the proper way to modify the update engine was
one of the parts I had the most difficulty with.  So feel free to dive
in and change it where you see it can be done better.

On Thu, Jan 31, 2013 at 7:48 AM, Rob Vesse <[email protected]> wrote:
> Awesome
>
> Can we have some javadoc on the new interfaces please?
>
> Couple of questions since I now need to at least stub these for our
> internal implementations.
>
> Primarily these revolve around how one opts out of dynamic updates since
> these won't be usable in our environment due to the peculiarities of our
> hybrid computing architecture.
>
> Is it safe/recommended to just throw an exception in methods like
> createStreaming() of UpdateEngineFactory if your implementation does not
> support streaming?

I didn't want to refactor too much, so the streaming support was
rather grafted on (and not that clear).  Basically there are now two
codepaths through Fuseki that depend on whether the GraphStore
implements the Transactional interface.  I think the code may
currently be considered kind of broken, as it assumes that if you
implement Transactional, then you can also supply a
UpdateEngineStreaming.  There is no fallback to a regular
UpdateEngine.

I think this is rather ugly, and we need a single codepath (based on a
streamable API).  See below in Andy's comments.

> Some possible slight refactors:
>
> UpdateEngineBase implements UpdateEngine and UpdateEngineStreaming - can I
> suggest that we add an UpdateEngineStreamingBase which adds the
> UpdateEngineStreaming interface so non-streaming implementations can still
> just extend UpdateEngineBase.  Either that implement UpdateEngineStreaming
> in UpdateEngineMain not in UpdateEngineBase.
>
> The getInsertDataSink() and getDeleteDataSink() on UpdateVisitor seem at
> odds with a visitor interface, would they not be better off in a separate
> interface?  I see only one usage of these so them being placed there seems
> somewhat arbitrary.

You're correct.  And, furthermore, it is inconsistent that queries and
insert/delete data operations are handled differently.  Ultimately
maybe the visitor interface isn't appropriate.  Basically I see two
options:

  1) Update engines need to create and return sinks that can accept
update operations (see UpdateSink)
  2) Update engines accept an Iterator of update operations
(Iterator<Update>, but with UpdateDataInsert and UpdateDataDelete
changed to return an iterator of quads instead of a list)

Option 2) is conceptually much cleaner from the update engine's point
of view, but to implement that in ARQ would require a queue and a
separate thread like PipedRDFStream / Iterator (this is driven by
JavaCC having to be in charge of parsing).

We currently have something like 1), but it is not very clean since
the old visitor interface is still there.

>
> Rob
>

On Thu, Jan 31, 2013 at 9:45 AM, Andy Seaborne <[email protected]> wrote:
> Great stuff and getting into 2.10 is good timing.
>
> I'm working through the changes so comments may come in a bit of a drip
> feed, and hence not necessarily in importance order.
>
>
> ** UpdateEngineFactory
> (default methods interfaces would be nice!)
>
> ++++
> It would be good to hear from from all storage layer implementer on this.
> ++++
>
> If we are making changes, then it would be better to rework this.  I find
> the two independent update engine concepts confusing and I think it does not
> help storage implementers.

I agree, it is not clean how it is.  We should solve this now so
implementer only have to make one change.

> When streaming, the ability to look at the request before deciding whether
> to accept it is lost.  No big deal - I don't think this is ever used.
> Looking at the request can be done by an indirection engine that decides at
> the point of execution which real engine to pass calls to.

Yes, drop that feature.

> So if we go for one UpdateEngine interface, UpdateEngineFactory becomes:
>
>   public boolean accept(GraphStore graphStore, Context context)
>   public UpdateEngine create(GraphStore graphStore, Context context)

I like it.

> Initial binding looks like a mistake because an update is multiple
> operations and historical at best. Remove this (via a deprecation cycle?).
> Use BIND or VALUES nowadays.

Agreed.  Initial binding should be removed.  I'd like to remove it
immediately w/o a deprecation cycle because it complicates the
implementation, and would require storage layer implementers to
continue to support it.  Also would require another breaking API
change when we eventually get rid of it.

It is a user facing change, but is it used much in the wild (I would
guess probably not)?  I would guess that it is probably used a lot
more on the query side (here we should deprecate instead of remove).

> Require an update engine impl to handle any request - it can always bump to
> the general engine.
>
> The core interface API is streaming and UpdateEngineBase provides stream ->
> request that can be overridden.
>
> (have now seen Rob's comment suggesting interfaces that can be instanceof'ed
> so a can just extend UpdateEngineBase as before).
>
>
> ** UsingList:
>
> When applied to a whole request, this is a protocol feature.  I think it's
> better to keep it as such.  It seems a small feature for API use thet could
> be handled via the context (it is the same as rewriting each UpdateModify
> with no set UsingList to have the global one).

Yes, I implemented this the wrong way by passing it all the way down
into the storage engine just so it could include it when it was
building an UpdateSink to return.  Instead a better way would be to
simply just wrap the UpdateSink that comes from the storage engine
with one that rewrites UpdateModify objects from the parser before
passing them on.

This should be an easy fix.

> The use case is forming a dataset for the WHERE part of an update from a
> large collection of graphs.  The storage engine may have opinions here (e.g.
> security, or only certain fixed shapes) so the selection process is, to my
> mind, impl specific.
>
> (but UpdateWithUsing - could do with the UsingList locally internally!)
>
>
> Alternative proposal:
>   have two datasets for update - the one to update and the one to query
>   GraphStore has a getDatasetForWHERE() operation.
>     We need to track down the use of GraphStore for WHERE
>     but only in the general purpose engine.
>     Normally it's "this" - can be reset.
>   It moves responsibility to GraphStore creation.
>   No UsingList in the UpdateExecutionFactory API is needed
>
> In ARQ, you create an appropriately structures GraphStore with the graphs of
> choice.  In Fuseki, it uses DynamicDatasets.dynamicDataset
> (c.f. UpdateEngineWorker.processUsing) for building the view dataset.
>
>
> ** DataSinks
>
> I didn't understand what's happening with the sinks, sorry.

The sinks for insert/delete data are a place to shove the quads that
get generated by the parser.  Instead of storing them in a List<Quad>,
they instead get passed all the way to the update engine which gave us
the sink.

> An update request can be more than one operations, including more than one
> INSERT DATA or DELETE DATA so multiple sinks of both kinds per request?
>
> Shouldn't the sinks be per operation?  In the writer for example:
>
>     @Override
>     public void visit(UpdateDataInsert update)
>     {
>         Iter.sendToSink(update.getQuads(),
>                         getInsertDataSink());
>     }
>
> maybe called for several UpdateDataInsert's

Yes, we call getInsertDataSink() per INSERT_DATA operation, and expect
the storage engine to give us a new one each time.

I am going to commit the change for simplifying the UsingList.  Do you
guys have an opinion on the two options I outlined in the response to
Rob?  I like 2) from an aesthetics point of view, but I worry a lot
about having to spawn a separate thread for each update query.  It
also is a big departure from the current API.  So I will look into how
to clean up 1).

-Stephen

Re: Streaming SPARQL Updates

Reply via email to