Re: RDFConnection

Stephen Allen Wed, 05 Aug 2015 16:10:40 -0700

On Wed, Aug 5, 2015 at 1:14 PM, Andy Seaborne <[email protected]> wrote:


> On 04/08/15 22:26, Stephen Allen wrote:
>
>> To my knowledge, the only argument for using GSP instead of just
>> query+update would be performance/scalability.  Although, when I have
>> encountered those issues, I've attempted to fix the problem in
>> query+update
>> instead (i.e. adding streaming support for update).  However, parsing
>> large
>> SPARQL INSERT DATA operations is still slower than parsing NT (not to
>> mention rdf/thrift).  There are potential solutions for that (a
>> sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
>> binary and left queries as string blobs), but obviously that doesn't exist
>> yet.
>>
> ...
>
>> One of the motivating features of jena-client was the ability to perform
>> large streaming updates (not just inserts/deletes) to a remote store.
>> This
>> made up somewhat for the lack of remote transactions.  But maybe that
>> isn't
>> too great of an argument, when we could just go ahead and implement remote
>> transaction support (here is a proposal I haven't worked on in over a year
>> [3]).
>>
>
> GSP is very useful for managing data in a store when combined with a union
> of named graphs as the default.  Units of the overall graph can be deleted
> (bnodes included) and replaced.
>
> It's also useful when scripting management of the data : using curl/wget
> you manage a store in simple scripts.  Being able to do that in the same
> way in Java is helpful so the user does not need two paradigms.
>

Sure, definitely makes sense.  It does seem like we can provide both
mechanisms in a straightforward way.


> Fuseki2 provides streaming updates for upload by GSP. RDFConnection has
> file upload features so the client-side does not need to parse the file,
> just pass an InputStream to HTTP layer.


Makes sense.  Jena-client doesn't do that because it has to transform it
into an update query, but obviously pays some penalties while doing that.


> RDFConnection adds the natural REST ops on datasets.
>
>
> Authentication:  we should use the HttpOp code - one reason is that it
> supports authentication for all HTTP verbs.
>
>
Agreed, jena-client uses the HttpOp code.



> Jena-client's is more like JDBC
>> in that the transaction operations are exposed on the Connection object.
>> If the user chooses not to use the transaction mechanism then it will
>> default to using "auto-commit"
>>
>
> Agreed and in fact there an issue here with autocommit, streaming and and
> SELECT queries.  The ResultSet is passed out of the execSelect operation
> but needs to be inside the transaction.  Autocommit defeats that.
>

Yes, I tried to mitigate that with the AutoCommitQueryExecution class.  It
wraps the QueryExecution used on a local dataset and then enforces
transactions semantics between the exec*() and close() methods.  Obviously
it relies on the user to call close() (or better yet use a
try-with-resources) on the corresponding QueryStatement (they never see the
QueryExecution object directly).


>
> Which touches on the JDBC issue that drivers tend to execute and receive
> all the results before the client can start working on the answers
> (sometimes there are ways round this to be used with care).  The issue is
> badly behaved clients hogging resources in the server.
>

We could default to copying results into memory if we wanted, but provide
an override to disable that.


>
> Some possibilities:
> 0/ Don't support autocommit.  In the local case that is quite natural;
> less so for the remote case because HTTP is not about state.
>
> (I looked more at the remote case - e.g. the local connection
> implementation isolates results to get the same semantics as remote.)
>
> 1/ Autocommit cases receive the results completely.  Some idioms don't
> work in autocommit more.
>
> 2/ An operation to make sure the QueryExecution is inside a transaction
> and also closed.
>
> RDFConnection
> public default void querySelect(Query query,
>                                 Consumer<QuerySolution> rowAction) {
>     Txn.executeRead(this, ()->{
>         try ( QueryExecution qExec = query(query) ) {
>             qExec.execSelect().forEachRemaining(rowAction);
>         }
>     } ) ;
> }
>

Although I think that using a Consumer like you do in 2) is a great way of
doing things (this is exclusively how we allow queries in our app), perhaps
that functionality should be built as a utility class on top of lower-level
functionality that does let you shoot yourself in the foot if you like.
Then strongly encourage users to do it the safe way.



>
> By the way - I added explicit transaction support and some example usage.
>
> Maybe we can use jena-client as a base to work from?  If we feel we want to
>> add the separate GSP operations, then I think the extension point would be
>> to add a new GSP interface similar to Updater [5] (but lacking the generic
>> update query functionality).
>>
>
> I have no problem with jena-client as the starting point, I want to
> understand its design first.
>
> I'm not seeing what the separate interfaces and *Statement gives the
> application - maybe I'm missing something here - it does seem to make it
> more complicated compared to just performing the operation.  For
> *Statement, it's still limited in scope to the connection but can be passed
> out.
>
>
The reason for all the interfaces was to ease different implementations.
There is already the local dataset and remote cases, and as you mention
below, possibly some other non-HTTP case.  Additionally perhaps 3rd parties
might want to do a different implementation.

The main reason for the new QueryStatement and UpdateStatement classes was
because QueryExecution (and similarly UpdateProcessor) has methods that
seemed inappropriate:
  * setInitialBinding(QuerySolution) - This is not SPARQL, and further only
works with local datasets
  * getDataset() - Mostly doesn't make sense for remote datasets (because
of blank nodes).  Also the remote case would have to fetch everything
eagerly.
  * getContext() - Only for local datasets
  * getQuery() - This method could make sense to add to QueryStatement.
Implies parsing the query client-side

But these are relatively minor reasons.



> Please remove the Sesame comments in javadoc and documentation.  There's
> no need to put comments about another community on implementation choices
> that can change in javadoc and documentation.  If you want to write up the
> reasons then have a blog item somewhere, and hence making it more time
> specific.
>
>
Yep, you are correct, was trying to be overly helpful.  Removed.



> We might want to consider a non-HTTP remote connection; at least design
> for the possibility.   My motivation was initially more around working with
> other people's published data (i.e. a long way away, not same data centre).
>
>
Yeah, that would be a good idea.  The HTTP protocol does impose some
annoying limitations, especially with transactions and duplex communication.

-Stephen

Re: RDFConnection

Reply via email to