Re: RDF Delta - recording changes to RDF Datasets

Stephen Allen Tue, 18 Jun 2013 16:42:22 -0700

On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <[email protected]> wrote:


> On 18/06/13 22:13, Rob Vesse wrote:
>
>> Hey Andy
>>
>
> Hi Rob - thanks for the comments - really appreciate feedback -
>
>
>
>> The basic approach looks sound and I like the simple text based format,
>> see my notes later on about maybe having a binary serialization as well.
>>
>
> A binary forms would excellent for this and for NT and NQ.  One of the
> speed limitations is parsing and Turtle is slower than NT (this isn't just
> a Jena effect).  gzip is neutral for reading but slows down writing.  So a
> fast file format would be quite useful to add to the tool box.
>
>
>  How do you envisage incremental backups being implemented in practice, you
>> suggest in the document that you would take a full RDF dump and then
>> compute the RDF delta from a previous backup.  Talking from the experience
>> of having done this as part of one of my experiments in my PhD this can be
>> very complex and time consuming to do especially if you need to take care
>> of BNode isomorphism.  I assume from some of the other discussion on
>> BNodes that you assume that IDs will remain stable across dumps, thus
>> there is an implicit requirement here that the database be able to dump
>> RDF using consistent BNode IDs (either internal IDs or some stable round
>> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
>> do not do this so there would need to be an option for those writers to be
>> able to support this.
>>
>
> Shh, don't tell anyone but n-quads and n-triples outputs do dump
> recoverable bNode labels :-)  TriG and Turtle do not - they try to be
> pretty.  The readers need a  tweak to recover them but the label->Node code
> has an option for various label policies and recover id from label is one
> of them.  This is not exposed formally - it's strictly illegal for RDF
> syntaxes.  Or use <_:label> URIs.
>
> I have prototyped a wrapper dataset that records changes as they happen
> driven off add(quad) and delete(quad).  This produces the RDF Delta (sp!)
> form so couple to xtn and you can have a "live incremental backup".
>
> A strict after-the-event delta would be prohibitively expensive.
>
>
>  Even without any concerns of BNode isomorphism comparing two RDF dumps to
>> create a delta could be a potentially very time consuming operation and
>> recording the deltas as changes happen may be far more efficient.  Of
>> course depending on the exact use case the RDF dump and compute delta
>> approach may be acceptable.
>>
>
> It isn't a delta in the set theory A\B sense - nor is it a diff (it's not
> reversible without the additional condition).  "delta" and "diff" are both
> names I've toyed with - "RDF changes" might better capture the idea.  Or
> "RDF Changes Log".
>
>
>  My main criticism is on the "Minimise actions" section, there needs to be
>> a more solid clarification of definitions and when minimization can and
>> should happen.
>>
>
> Yes - it isn't as well covered in the doc.
>
> Logically - or generally - in teh event generating dataset wrpapper:
>
>         if ( contains(g,s,p,o) ) {
>             record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
>             return ;
>         }
>
>         add(g,s,p,o) ;
>         record(QuadAction.ADD,g,s,p,o) ;        // Action.
>
> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
> projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/recorder>
>
> but implementations like TDB can do it without the contains() as the
> indexes already return true/false for whether a change occurred or not.
>
>
>
>> For example:
>>
>> "When written in minimise form the RDF Delta can be run backwards, to undo
>> a change. This only works when real changes are recorded because otherwise
>> knowing a triple is added does not mean it was not there before."
>>
>> While I agree it is necessary to record real changes for deltas to be
>> reverse applied I'm not convinced they have to be in minimized form (at
>> least based on how the definition of minimized form reads right now), if
>> only real changes are recorded then deltas will be in a minimal form.
>>
>> Yet it is not entirely clear by your definition the following delta would
>> be considered minimal:
>>
>> A <http://s> <http://p> <http://o>
>> R <http://s> <http://p> <http://o>
>> A <http://s> <http://p> <http://o>
>>
>
> If the dataset did not originally contain <http://s> <http://p> <http://o>
> then that is minimal.  Each row makes a real change ; it's the fast that
> graphs/datasets are set of triples/quads that the real change is needed.
>
>
>  I'm assuming that your intention was that such deltas should not be
>> minimized but perhaps this needs to be more clear in the document.
>>
>
> There is no reason not to allow the redundant first two A-D to be removed
> but it's not required.
>
>
>  On the topic of related work:
>>
>> I think I may have mentioned previously that I've done some research work
>> internally here at YarcData on a general purpose binary serialization for
>> Triples, Quads and Tuples which likely could be fairly trivially extended
>> to carry a binary encoding of the deltas as well which may save space.
>> For ball park comparison purposes compression is roughly equivalent to
>> GZipping raw NTriples with the key advantage being that the format is
>> significantly faster to process even in its current prototype single
>> threaded implementation (the design was written to take advantage of
>> parallelism).  There are a bunch of further optimizations that I had ideas
>> for that I never got as far as implementing because of lack of management
>> support for the concept.
>>
>
> My experience is that the cost of writing gzip is an appreciable slowdown.
>  If your binary form removes that cost it would help full backups quiet a
> lot.
>
>
>  There has been some discussion of open sourcing this work (likely as a
>> contributed Experimental module to Jena) so that it could be developed
>> outside of the company, if this sounds like it may be of interest I will
>> broach the subject with relevant management again and see whether this can
>> happen in the near future.
>>
>
> Please do.  I find the style of having a text form and a binary form makes
> system building easier.  Text files to debug; binary for production.
>
> We can add e.g. .ntz and .nqz to the known formats -- modules can add
> language, syntaxes, parsers and writers.  The JSON-LD module does, so I
> know it does work from outside; all the built-in ones actually register
> themselves the same way and have no specials.
>
>
Rob:

I would definitely be interested in a binary format for both triples and
quads.  In fact, if it could be generalized to handle arbitrarily sized RDF
tuples, that would be even better.  I would like to replace the current
text-based solution used for the spill-to-disk functionality.

Andy:
I like what you've done and think it could be very useful.  One suggestion:
the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
format [1].


-Stephen

[1] http://www.w3.org/TR/n-quads/

Re: RDF Delta - recording changes to RDF Datasets

Reply via email to