Re: RDF Patch - experiences suggesting changes

Rob Vesse Fri, 14 Oct 2016 07:24:12 -0700

Thanks for sending this out

Another use case that springs to mind is for write ahead logging particularly 
for reversible patches.


On the subject prefixes I agree that being able to record prefix definitions it 
Is useful and I am strongly in favour of not using them to compact the data. As 
you say it actually makes reading and writing the Data slower as well as 
requiring additional state to be recorded during processing.

 I like the use of transaction boundaries, I also like A.Soroka’s suggestion on 
making the reversible flag be Applied to transaction begin rather than to the 
patch as a whole though I don’t see any problem with supporting both forms. I 
think reversible patches are an essential feature.

 For the version control aspect I would be tempted to not constrain it to UUID 
and simply say that it is an identifier for the parent state to which the patch 
is applied. This will then allow people the freedom to use hash algorithms, 
simple counters etc or any other Version identification scheme they desired. I 
might even be tempted to suggest that it should be a URI so that people can use 
identifiers in their own name spaces to reduce the chance collisions.

 I can see the value of supporting meta data about the patch both within it and 
in any protocol used to communicate it. Checksums are fine although if you 
include this then you probably need to define exactly how each checksum should 
be calculated.

 As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you get into 
wild cards you may as well just use SPARQL Update. Plus the meaning of the wild 
card is dependent on the dataset to which it is applied which completely 
defeats the purpose of being a canonical description of changes
- I am strongly for including the REPEAT term. This has the potential to offer 
significant compression particularly if the system producing the patch chooses 
to group changes by subject and predicate À la turtle and most other syntaxes.
- Having a term for the default graph could prove useful

Rob


On 13/10/2016 16:32, "Andy Seaborne" <andy.seabo...@gmail.com on behalf of 
a...@seaborne.org> wrote:

    I've been using modified RDF Patch for the data exchanged to keep 
    multiple datasets synchronized.
    
    My primary use case is having multiple copies of the datasets for a high 
    availability solution.  It has to be a general solution for any data.
    
    There are some changes to the format that this work has highlighted.
    
    [RDF Patch - v1]
    https://afs.github.io/rdf-patch/
    
    
    1/ Record changes to prefixes
    
    Just handling quads/triples isn't enough - to keep two datasets in-step, 
    we also need to record changes to prefixes.  While they don't change the 
    meaning of the data, application developers and users like prefixes.
    
    2/ Remove the in-data prefixes feature.
    
    RDF Patch has the feature to define prefixes in the data and use them 
    for prefix names later in the data using @prefix.
    
    This seems to have no real advantage, it can slow things down (c.f. 
    N-Triples parsing is faster than Turtle parsing - prefixes is part of 
    that), and it generally complicates the data form.
    
    When including "add"/"delete" prefixes on the dataset (1) it also makes 
    it quite confusing.
    
    Whether the "R" for "repeat" entry from previous row should also be 
    removed is an open question.
    
    3/ Record transaction boundaries.
    
    (A.3 in RDF Patch v1)
    http://afs.github.io/rdf-patch/#transaction-boundaries
    
    Having the transaction boundaries recorded means that they can be 
    replayed when applying the patch.  While often a patch will be one 
    transaction, patches can be consolidated by concatenation.
    
    There 3 operations:
    
    TB, TC, TA - Transaction Begin, Commit, Abort.
    
    Abort is useful to include because to know whether a transaction in a 
    patch is going to commit or abort means waiting until the end.  That 
    could be buffering client-side, or buffering server-side (or not writing 
    the patch to a file) and having a means to discard a patch stream.
    
    Instead, allow a transaction to record an abort, and say that aborted 
    transactions in patches can be discarded downstream.
    
    4/ Reversibility is a patch feature.
    
    The RDF Patch v1 document includes "canonical patch" (section 9)
    http://afs.github.io/rdf-patch/#canonical-patches
    
    Such a patch is reversible (it can undo changes) if the adds and deletes 
    are recorded only if they lead to a real change.  "Add quad" must mean 
    "there was no quad in the set before".  But this only makes sense if the 
    whole patch has this property.
    
    RDF Patches are in general entries in a "redo log" - you can apply the 
    patch over and over again and it will end up in the same state (they are 
    idempotent).
    
    A reversible patch is also an "undo log" entry and if you apply it in 
    reverse order, it acts to undo the patch played forwards.
    
    Testing whether a triple or quad is already present while performing 
    updates is not cheap - and in some cases where the patch is being 
    computed without reference to an existing dataset may not be possible.
    
    What would be useful is to label the patch itself to say whether it is 
    reversible.
    
    5/ "RDF Git"
    
    A patch should be able to record where it can be applied.  If RDF Patch 
    is being used to keep two datasets in-step, then some checking to know 
    that the patch can be applied to a copy because it is a patch created 
    from the previous version
    
    So give each version of the dataset a UUID for a version then record the 
    old ("parent") UUID and the new UUID in the patch.
    
    If the version checked and enforced, we get a chain of versions and 
    patches that lead from one state to another without risk of concurrent 
    changes getting mixed in.
    
    This is like git - a patch can be accepted if the versions align 
    otherwise it is rejected (more a git repo not accepting a push than a 
    merge conflict).
    
    Or some system may want to apply any patch and so create a tree of 
    changes.  For the use case of keeping two datasets in-step, that's not 
    what is wanted but other use cases may be better served by having the 
    primary version chain sorted out by higher level software; a patch may 
    be a "proposed change".
    
    6/ Packets of change.
    
    To have 4 (label a patch with reversible) and 5 (the version details), 
    there needs to be somewhere to put the information. Having it in the 
    patch itself means that the whole unit can be stored in a file.  If it 
    is in the protocol, like HTTP for E-tags then the information becomes 
    separated.  That is not to say that it can't also be in the protocol but 
    it needs support in the data format.
    
    7/ Checksum
    
    Another feature to add to the packet is a checksum. A hash (which one? 
    git uses SHA1) from start of packet header, including the initial 
    version (UUID), the version on applying the patch (UUID) and the changes 
    (i.e. start of packet to after the DOT of the last line of change), 
    makes the packet robust to editting after creating it.   Like git; git 
    uses it as the "object id".
    
    So a patch packet for a single transaction:
    
    PARENT <UUID>
    VERSION <UUID>
    REVERSIBLE           optional
    TB
    QA ...
    QD ...
    PA ...
    PD ...
    TC
    H <sha1sum>
    
    where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add 
    prefix" and "delete prefix"
    
        Andy
    
    
    [RDF Patch - v1]
    https://afs.github.io/rdf-patch/
    
    RDF Patch - updated library
    work in progress (does not have "packets").
    
    https://github.com/afs/rdf-delta/tree/master/rdf-patch

Re: RDF Patch - experiences suggesting changes

Reply via email to