Thoughts in-line. (Incidentally, my immediate interest in RDF Patch is pretty
similar; robustness via distribution, but there's also a smaller, more
theoretical interest for me in automatically "shredding" or "sharding" datasets
across networks for higher persistence and query throughput.)
The University of Virginia Library
> On Oct 13, 2016, at 11:32 AM, Andy Seaborne <a...@seaborne.org> wrote:
> 1/ Record changes to prefixes
> Just handling quads/triples isn't enough - to keep two datasets in-step, we
> also need to record changes to prefixes. While they don't change the meaning
> of the data, application developers and users like prefixes.
Boo, hiss, but I can see your point. The worry to me would be the inevitable
semantic overloading that will come with. But I guess that cake has already
been baked by all the other RDF formats except NTriples.
> 2/ Remove the in-data prefixes feature.
> RDF Patch has the feature to define prefixes in the data and use them for
> prefix names later in the data using @prefix.
> This seems to have no real advantage, it can slow things down (c.f. N-Triples
> parsing is faster than Turtle parsing - prefixes is part of that), and it
> generally complicates the data form.
> When including "add"/"delete" prefixes on the dataset (1) it also makes it
> quite confusing.
> Whether the "R" for "repeat" entry from previous row should also be removed
> is an open question.
I would agree with removing R, and the reason is that it doesn't remove lines.
In other words, the abbreviation it offers is pretty minimal. On the other
hand, it is relatively cheap to implement (4 slots of state) so I wouldn't
argue very much to remove it.
> 3/ Record transaction boundaries.
> (A.3 in RDF Patch v1)
> Having the transaction boundaries recorded means that they can be replayed
> when applying the patch. While often a patch will be one transaction,
> patches can be consolidated by concatenation.
> There 3 operations:
> TB, TC, TA - Transaction Begin, Commit, Abort.
> Abort is useful to include because to know whether a transaction in a patch
> is going to commit or abort means waiting until the end. That could be
> buffering client-side, or buffering server-side (or not writing the patch to
> a file) and having a means to discard a patch stream.
> Instead, allow a transaction to record an abort, and say that aborted
> transactions in patches can be discarded downstream.
This is very good stuff. It would be nice to include a definition of
"transaction-compact" in which no TA may appear. It would enable RDF Patch
readers to make a very convenient assumption.
> 4/ Reversibility is a patch feature.
> The RDF Patch v1 document includes "canonical patch" (section 9)
> Such a patch is reversible (it can undo changes) if the adds and deletes are
> recorded only if they lead to a real change. "Add quad" must mean "there was
> no quad in the set before". But this only makes sense if the whole patch has
> this property.
> What would be useful is to label the patch itself to say whether it is
Just a thought-- you could change BEGIN to permit "flags". So you could have:
and you get "canonicity" on a per-transaction level. A patch could optionally
make explicit its wrapping BEGIN and END for this kind of use.
> 5/ "RDF Git"
> A patch should be able to record where it can be applied. If RDF Patch is
> being used to keep two datasets in-step, then some checking to know that the
> patch can be applied to a copy because it is a patch created from the
> previous version
> So give each version of the dataset a UUID for a version then record the old
> ("parent") UUID and the new UUID in the patch.
> Or some system may want to apply any patch and so create a tree of changes.
> For the use case of keeping two datasets in-step, that's not what is wanted
> but other use cases may be better served by having the primary version chain
> sorted out by higher level software; a patch may be a "proposed change".
Yes, the roaring success of Git (and other DVCS) may imply that letting patches
be pure changes (not connected to particular versions of the dataset, just
"isolated" deltas) is the right way to think about them. The word "patch",
itself, is usefully suggestive. That doesn't mean avoiding any versioning info,
just making clear that datasets have versions, and the UUIDs associated with a
given patch refer to where it _came from_, but you can still apply it to
whatever you want (like cherry-picking Git commits).
Or another way to think about it: any dataset is just the sum of a series of
patches (a random dataset with no history has an implicit history of one
"virtual" patch with nothing but adds). So those UUIDs are roughly equivalent
to a series of some patch IDs. So I _think_ you could alternatively assign just
patch IDs and record a "parent" patch ID and a "self" patch ID for each patch.
Then the question "Am I supposed to be able to use this patch on this dataset?"
is answerable if you know the patch ID of the last patch applied. Not too
different from dataset version UUIDs but it avoids introducing the notion of
dataset version in favor of "pure changes".
> 6/ Packets of change.
> To have 4 (label a patch with reversible) and 5 (the version details), there
> needs to be somewhere to put the information. Having it in the patch itself
> means that the whole unit can be stored in a file. If it is in the protocol,
> like HTTP for E-tags then the information becomes separated. That is not to
> say that it can't also be in the protocol but it needs support in the data
As long as the sort of information about which we are thinking makes sense on a
per-transaction basis, that could be as I suggest above, as "metadata" on BEGIN.
> So a patch packet for a single transaction:
> PARENT <UUID>
> VERSION <UUID>
> REVERSIBLE optional
> QA ...
> QD ...
> PA ...
> PD ...
> H <sha1sum>
> where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add prefix"
> and "delete prefix"
I'm suggesting something more like:
TB PARENT <UUID> VERSION <UUID> REVERSIBLE
TC H <sha1sum>
Or even just positionally
TB <UUID> <UUID> REVERSIBLE
I'll add a further point that isn't in response to your thoughts:
You have a section:
> Binary Format
> An alternative wire format for efficient processing.
> (Need to quantify the gains, if any).
You might consider getting rid of R (or even ANY) and just concentrating on
extreme clarity and speed of parsing for the basic format, and leaving all
abbreviation for an additional binary format that offers compactness. If that
doesn't make sense, the real point I'm offering is that you have two values in
hand, parsing efficiency and compactness. It might be difficult to balance both
in both a basic and a binary form and still offer any real advantage to using
binary. But if you separate the values, it might clarify the decision for the
user when to use basic or binary. Maybe not. Just a thought...
The University of Virginia Library