Re: Update Streaming (was Re: can fuseki be invoked so that arriving queries are logged?)

Stephen Allen Wed, 06 Jun 2012 22:57:55 -0700

Tried to splice the two emails back together.  Comments inline:

On Wed, Jun 6, 2012 at 11:49 AM, Andy Seaborne <[email protected]> wrote:
>
> On Wed, Jun 6, 2012 at 11:01 AM, Stephen Allen<[email protected]>  wrote:
>>
>> Hi Andy,
>>
>> I was curious about a statement you made on the user list yesterday:
>>
>> On Tue, Jun 5, 2012 at 2:43 PM, Andy Seaborne<[email protected]>  wrote:
>>>
>>>
>>> Updates don't log. Form submitted updates are buffered - the entire string
>>> is available to be printed but ones sent as "application/sparql-update" are
>>> stream read (e.g. a large INSERT DATA { .... })
>>>
>>
>> I was looking at the parsing code, and it's true that
>> "application/x-www-form-urlencoded" updates are buffered into a String
>> early in the process, although it appears to me that for
>> "application/sparql-update", the ARQParser and SPARQLParser11 also
>> have to buffer all the update data in UpdateRequest objects (which for
>> the DATA methods are an in-memory list of Quads).
>
> Yes and no.  The input stream is directly parsed to a syntax tree so the
> string (the body) of the POST is not available to be printed.  There is
> "just" the one copy.
>
> It also means if there is a parse error, the request is not printed in
> normal set up.
>
> This is a balance - HTTP is generally about validate-execute and also it is
> good to know the operation is valid before starting (not everything is
> transactional).
>


Yes, we would need to validate the entire request if the underlying
store is not transactional and we want to attempt to be somewhat
atomic (although the WD says SHOULD on this matter).  We'd need to
spill to disk if the request is large.


> Maybe it should be less clever and do string-log-parse-execute.
>
>> I have been thinking about how to make this process streaming but I
>> didn't know whether it made sense to try to modify the JavaCC parsers
>> to be streaming or try to build a hybrid parser for just SPARQL
>> Update.  This hybrid would handle INSERT DATA and DELETE DATA in a
>> streaming manner, and delegate regular updates to the existing parser.
>>  Do you have any thoughts or advice?
>>
>
> There is a tension between operations of just INSERT/DELETE DATA and
> combined, complex multi-part operations.  The latter leans towards complex
> parsing of whole sequences of actions before any operation.
>

Well, maybe we make it look like streaming, even if we're spilling out
to disk.  If we want to maintain atomicity (or an approximation), then
we can dump the request to disk as we are validating, and then replay
it for the actual update.  For transactional stores, we can dispense
with this and do validation and insertion/deletion at the same time.

> So I think a separate, streaming, bulk-focused parser for INSERT DATA and
> DELETE DATA would be the way to go (and update processor etc).
>
> javacc sharing is not something I have ever managed to get working to
> separate the grammar from actions without distorting the entire thing to be
> dominated by that design goal.  I have tried to remove all code from the
> parser, and just use events and the parser is streaming, the super class
> code builds the state.  It could be redone to pass in a builder, not use the
> superclass.  SPARQL Update does include the whole of SPARQL Query pattern
> matching.
>
> So it's a bit of a mess from the multi-use point-of-view but the spec is
> stable so copy is tolerable (if somewhat irritating from an aesthetic POV).
>
> Actually, this looks like the tip of a general need for a non-SPARQL (or
> SPARQL+ if you prefer) remote interface to Fuseki.  See also the users@
> question and transactions across several Fuseki operations.

Transactions across multiple HTTP requests (even query/update) is one
of the features I'd like to support in jena-client.  Also remote query
cancellation.

And then probably lots of other stuff like query metrics, server
health, etc, but that sort of thing seems custom to Fuseki.  JMX might
work well here.

> So may there is a language lurking around here somewhere.  It would
> stream-execute. More fine grained than GSP, less than full SPARQL Update.
>
> INSERT DATA, DELETE DATA
> BEGIN/COMMIT/ABORT
> CLEAR/DROP, LOAD
> CREATE DATASET, DROP DATASET
> UNMOUNT DATASET
> MOUNT DATASET
> BACKUP DATASET
> ...

I see where you're going with the BEGIN/COMMIT/ABORT.  I think some
way of doing transactions and query identification (for tracking and
cancellation purposes) is something whose time has arrived, and we
should try out some implementations (with an eye towards future
standardization).  The DATASET commands might be a little bit too
implementation specific for a standard, but would be a cool feature
for Fuseki.

However I'm not convinced that we should have the overlapping SPARQL
Update commands.  It seems that we probably want to support Update
properly, and then there wouldn't be much use for the overlap.  SPARQL
1.1 Update appears amenable to streaming if we put some brainpower on
it.  I don't have much experience with JavaCC, but am willing to
learn.

>>
>> Another much simpler (although perhaps less satisfying) option would
>> be to replace the ArrayList in QuadAcc with a DataBag.
>>
>
> Partially - aren't we going to want to not allow other SPARQL operations
> that aren't wanted when streaming? That's where I got to about a special
> store language.

I think we would want to allow all possible SPARQL Update operations
while supporting streaming.  As an example, the current app I'm
developing generates triples/quads to be streamed, while at the same
time interleaving update queries.  Separate requests means you lose
atomicity.

> A SPARQL/Update can be several operations in a request.  It's not just what
> can be done but what needs to be made not possible.
>
> Is there an order preserving DataBag impl?  This is also used to serialize
> updates as well.

Yes, the not-so-well-named DefaultDataBag [1].

> Feels like both small/general/nice-errors, and large/stream/less-nice-errors
> are pulling in different directions a little.
>

Doing the Databag implementation seems like the most expedient way
forward.  But as it does impose an unnecessary cost on transactional
stores, I think the improved Update parser will eventually be what we
want.

-Stephen

[1] 
http://svn.apache.org/repos/asf/jena/trunk/jena-arq/src/main/java/org/openjena/atlas/data/DefaultDataBag.java

Re: Update Streaming (was Re: can fuseki be invoked so that arriving queries are logged?)

Reply via email to