Re: RDF Patch - experiences suggesting changes

Paul Houle Thu, 13 Oct 2016 09:02:58 -0700

There is another use case for an "RDF Patch" which applies to
hand-written models.  For instance I have a model which describes a job
that is run in AWS that looks like


@prefix : <http://rdf.ontology2.com/henson/aws/>
@prefix parameter: <http://rdf.ontology2.com/henson/aws/parameter/>

:Server
   :subnetId "subnet-e0ab0197";
   :baseImage "ami-ea602afd";
   :instanceType "r3.xlarge";
   :keyName "o2key";
   :keyFile "~/AMZN Keys/o2key.ppk" ;
   :securityGroupIds "sg-bca0b2d9" ;
   :todo "dbpedia-load" ;
   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
   parameter:GRAPH_NAME "http://dbpedia.org/"; ;
   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
   :iamProfile <arn:aws:iam::181667415011:instance-profile/Marcabian> ;
   :instanceName "Image Build Server";
   :qBase <https://sqs.us-east-1.amazonaws.com/181667415011/> .

one thing you might want to do is modify it so it uses a different
:baseImage or a different :instanceType and a natural way to do that is
to say

'remove :Server :instanceType ?x and insert :Server :instanceType
"r3.2xlarge"'

but better than that if you have a schema that says ":instanceType is a
single valued property" you can write another graph like

:Server
   :instanceType "r3.2xlarge" .

and merge it with the first graph to get the desired effect.

More generally this fits into the theme that "the structure of
commonsense knowledge is that there are rules,  then exceptions to the
rules,  then exceptions to the exceptions of the rules,  etc."

For instance I extracted a geospatial database out of Freebase that was
about 10 million facts and I found I had to add and remove about 10
facts on the route to a 99% success rate at a geospatial recognition
task.  A disciplined approach to "agreeing to disagree" goes a long way
to solve the problem that specific applications require us to split
hairs in different ways.



-- 
  Paul Houle
  [email protected]

On Thu, Oct 13, 2016, at 11:32 AM, Andy Seaborne wrote:
> I've been using modified RDF Patch for the data exchanged to keep 
> multiple datasets synchronized.
> 
> My primary use case is having multiple copies of the datasets for a high 
> availability solution.  It has to be a general solution for any data.
> 
> There are some changes to the format that this work has highlighted.
> 
> [RDF Patch - v1]
> https://afs.github.io/rdf-patch/
> 
> 
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, 
> we also need to record changes to prefixes.  While they don't change the 
> meaning of the data, application developers and users like prefixes.
> 
> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them 
> for prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. 
> N-Triples parsing is faster than Turtle parsing - prefixes is part of 
> that), and it generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes 
> it quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be 
> removed is an open question.
> 
> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be 
> replayed when applying the patch.  While often a patch will be one 
> transaction, patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a 
> patch is going to commit or abort means waiting until the end.  That 
> could be buffering client-side, or buffering server-side (or not writing 
> the patch to a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.
> 
> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes 
> are recorded only if they lead to a real change.  "Add quad" must mean 
> "there was no quad in the set before".  But this only makes sense if the 
> whole patch has this property.
> 
> RDF Patches are in general entries in a "redo log" - you can apply the 
> patch over and over again and it will end up in the same state (they are 
> idempotent).
> 
> A reversible patch is also an "undo log" entry and if you apply it in 
> reverse order, it acts to undo the patch played forwards.
> 
> Testing whether a triple or quad is already present while performing 
> updates is not cheap - and in some cases where the patch is being 
> computed without reference to an existing dataset may not be possible.
> 
> What would be useful is to label the patch itself to say whether it is 
> reversible.
> 
> 5/ "RDF Git"
> 
> A patch should be able to record where it can be applied.  If RDF Patch 
> is being used to keep two datasets in-step, then some checking to know 
> that the patch can be applied to a copy because it is a patch created 
> from the previous version
> 
> So give each version of the dataset a UUID for a version then record the 
> old ("parent") UUID and the new UUID in the patch.
> 
> If the version checked and enforced, we get a chain of versions and 
> patches that lead from one state to another without risk of concurrent 
> changes getting mixed in.
> 
> This is like git - a patch can be accepted if the versions align 
> otherwise it is rejected (more a git repo not accepting a push than a 
> merge conflict).
> 
> Or some system may want to apply any patch and so create a tree of 
> changes.  For the use case of keeping two datasets in-step, that's not 
> what is wanted but other use cases may be better served by having the 
> primary version chain sorted out by higher level software; a patch may 
> be a "proposed change".
> 
> 6/ Packets of change.
> 
> To have 4 (label a patch with reversible) and 5 (the version details), 
> there needs to be somewhere to put the information. Having it in the 
> patch itself means that the whole unit can be stored in a file.  If it 
> is in the protocol, like HTTP for E-tags then the information becomes 
> separated.  That is not to say that it can't also be in the protocol but 
> it needs support in the data format.
> 
> 7/ Checksum
> 
> Another feature to add to the packet is a checksum. A hash (which one? 
> git uses SHA1) from start of packet header, including the initial 
> version (UUID), the version on applying the patch (UUID) and the changes 
> (i.e. start of packet to after the DOT of the last line of change), 
> makes the packet robust to editting after creating it.   Like git; git 
> uses it as the "object id".
> 
> So a patch packet for a single transaction:
> 
> PARENT <UUID>
> VERSION <UUID>
> REVERSIBLE           optional
> TB
> QA ...
> QD ...
> PA ...
> PD ...
> TC
> H <sha1sum>
> 
> where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add 
> prefix" and "delete prefix"
> 
>       Andy
> 
> 
> [RDF Patch - v1]
> https://afs.github.io/rdf-patch/
> 
> RDF Patch - updated library
> work in progress (does not have "packets").
> 
> https://github.com/afs/rdf-delta/tree/master/rdf-patch

Re: RDF Patch - experiences suggesting changes

Reply via email to