On Wed, Aug 5, 2015 at 1:14 PM, Andy Seaborne <[email protected]> wrote:
> On 04/08/15 22:26, Stephen Allen wrote:
>
>> To my knowledge, the only argument for using GSP instead of just
>> query+update would be performance/scalability. Although, when I have
>> encountered those issues, I've attempted to fix the problem in
>> query+update
>> instead (i.e. adding streaming support for update). However, parsing
>> large
>> SPARQL INSERT DATA operations is still slower than parsing NT (not to
>> mention rdf/thrift). There are potential solutions for that (a
>> sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
>> binary and left queries as string blobs), but obviously that doesn't exist
>> yet.
>>
> ...
>
>> One of the motivating features of jena-client was the ability to perform
>> large streaming updates (not just inserts/deletes) to a remote store.
>> This
>> made up somewhat for the lack of remote transactions. But maybe that
>> isn't
>> too great of an argument, when we could just go ahead and implement remote
>> transaction support (here is a proposal I haven't worked on in over a year
>> [3]).
>>
>
> GSP is very useful for managing data in a store when combined with a union
> of named graphs as the default. Units of the overall graph can be deleted
> (bnodes included) and replaced.
>
> It's also useful when scripting management of the data : using curl/wget
> you manage a store in simple scripts. Being able to do that in the same
> way in Java is helpful so the user does not need two paradigms.
>
Sure, definitely makes sense. It does seem like we can provide both
mechanisms in a straightforward way.
> Fuseki2 provides streaming updates for upload by GSP. RDFConnection has
> file upload features so the client-side does not need to parse the file,
> just pass an InputStream to HTTP layer.
Makes sense. Jena-client doesn't do that because it has to transform it
into an update query, but obviously pays some penalties while doing that.
> RDFConnection adds the natural REST ops on datasets.
>
>
> Authentication: we should use the HttpOp code - one reason is that it
> supports authentication for all HTTP verbs.
>
>
Agreed, jena-client uses the HttpOp code.
> Jena-client's is more like JDBC
>> in that the transaction operations are exposed on the Connection object.
>> If the user chooses not to use the transaction mechanism then it will
>> default to using "auto-commit"
>>
>
> Agreed and in fact there an issue here with autocommit, streaming and and
> SELECT queries. The ResultSet is passed out of the execSelect operation
> but needs to be inside the transaction. Autocommit defeats that.
>
Yes, I tried to mitigate that with the AutoCommitQueryExecution class. It
wraps the QueryExecution used on a local dataset and then enforces
transactions semantics between the exec*() and close() methods. Obviously
it relies on the user to call close() (or better yet use a
try-with-resources) on the corresponding QueryStatement (they never see the
QueryExecution object directly).
>
> Which touches on the JDBC issue that drivers tend to execute and receive
> all the results before the client can start working on the answers
> (sometimes there are ways round this to be used with care). The issue is
> badly behaved clients hogging resources in the server.
>
We could default to copying results into memory if we wanted, but provide
an override to disable that.
>
> Some possibilities:
> 0/ Don't support autocommit. In the local case that is quite natural;
> less so for the remote case because HTTP is not about state.
>
> (I looked more at the remote case - e.g. the local connection
> implementation isolates results to get the same semantics as remote.)
>
> 1/ Autocommit cases receive the results completely. Some idioms don't
> work in autocommit more.
>
> 2/ An operation to make sure the QueryExecution is inside a transaction
> and also closed.
>
> RDFConnection
> public default void querySelect(Query query,
> Consumer<QuerySolution> rowAction) {
> Txn.executeRead(this, ()->{
> try ( QueryExecution qExec = query(query) ) {
> qExec.execSelect().forEachRemaining(rowAction);
> }
> } ) ;
> }
>
Although I think that using a Consumer like you do in 2) is a great way of
doing things (this is exclusively how we allow queries in our app), perhaps
that functionality should be built as a utility class on top of lower-level
functionality that does let you shoot yourself in the foot if you like.
Then strongly encourage users to do it the safe way.
>
> By the way - I added explicit transaction support and some example usage.
>
> Maybe we can use jena-client as a base to work from? If we feel we want to
>> add the separate GSP operations, then I think the extension point would be
>> to add a new GSP interface similar to Updater [5] (but lacking the generic
>> update query functionality).
>>
>
> I have no problem with jena-client as the starting point, I want to
> understand its design first.
>
> I'm not seeing what the separate interfaces and *Statement gives the
> application - maybe I'm missing something here - it does seem to make it
> more complicated compared to just performing the operation. For
> *Statement, it's still limited in scope to the connection but can be passed
> out.
>
>
The reason for all the interfaces was to ease different implementations.
There is already the local dataset and remote cases, and as you mention
below, possibly some other non-HTTP case. Additionally perhaps 3rd parties
might want to do a different implementation.
The main reason for the new QueryStatement and UpdateStatement classes was
because QueryExecution (and similarly UpdateProcessor) has methods that
seemed inappropriate:
* setInitialBinding(QuerySolution) - This is not SPARQL, and further only
works with local datasets
* getDataset() - Mostly doesn't make sense for remote datasets (because
of blank nodes). Also the remote case would have to fetch everything
eagerly.
* getContext() - Only for local datasets
* getQuery() - This method could make sense to add to QueryStatement.
Implies parsing the query client-side
But these are relatively minor reasons.
> Please remove the Sesame comments in javadoc and documentation. There's
> no need to put comments about another community on implementation choices
> that can change in javadoc and documentation. If you want to write up the
> reasons then have a blog item somewhere, and hence making it more time
> specific.
>
>
Yep, you are correct, was trying to be overly helpful. Removed.
> We might want to consider a non-HTTP remote connection; at least design
> for the possibility. My motivation was initially more around working with
> other people's published data (i.e. a long way away, not same data centre).
>
>
Yeah, that would be a good idea. The HTTP protocol does impose some
annoying limitations, especially with transactions and duplex communication.
-Stephen