The format already allows arbitrarily sized tuples (well in the current form it is capped at 255 columns per tuple) though it assumes that this will be used to convey SPARQL results and thus currently requires that column headers be provided. Both those restrictions would be fairly easy to remove.
I will raise the issue of open sourcing with management again and see if I get any traction. On the subject of column ordering I can see benefits of putting the <g> field first in that it may make it easier to batch operations on a single graph though I don't think putting it at the end to align with NQuads precludes this you just require slightly more lookahead to determine whether to continue adding statements to your batch. Rob On 6/18/13 4:41 PM, "Stephen Allen" <[email protected]> wrote: >On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <[email protected]> wrote: > >> On 18/06/13 22:13, Rob Vesse wrote: >> >>> Hey Andy >>> >> >> Hi Rob - thanks for the comments - really appreciate feedback - >> >> >> >>> The basic approach looks sound and I like the simple text based format, >>> see my notes later on about maybe having a binary serialization as >>>well. >>> >> >> A binary forms would excellent for this and for NT and NQ. One of the >> speed limitations is parsing and Turtle is slower than NT (this isn't >>just >> a Jena effect). gzip is neutral for reading but slows down writing. >>So a >> fast file format would be quite useful to add to the tool box. >> >> >> How do you envisage incremental backups being implemented in practice, >>you >>> suggest in the document that you would take a full RDF dump and then >>> compute the RDF delta from a previous backup. Talking from the >>>experience >>> of having done this as part of one of my experiments in my PhD this >>>can be >>> very complex and time consuming to do especially if you need to take >>>care >>> of BNode isomorphism. I assume from some of the other discussion on >>> BNodes that you assume that IDs will remain stable across dumps, thus >>> there is an implicit requirement here that the database be able to dump >>> RDF using consistent BNode IDs (either internal IDs or some stable >>>round >>> trippable IDs). Taking ARQ as an example the existing NQuads/TriG >>>writers >>> do not do this so there would need to be an option for those writers >>>to be >>> able to support this. >>> >> >> Shh, don't tell anyone but n-quads and n-triples outputs do dump >> recoverable bNode labels :-) TriG and Turtle do not - they try to be >> pretty. The readers need a tweak to recover them but the label->Node >>code >> has an option for various label policies and recover id from label is >>one >> of them. This is not exposed formally - it's strictly illegal for RDF >> syntaxes. Or use <_:label> URIs. >> >> I have prototyped a wrapper dataset that records changes as they happen >> driven off add(quad) and delete(quad). This produces the RDF Delta >>(sp!) >> form so couple to xtn and you can have a "live incremental backup". >> >> A strict after-the-event delta would be prohibitively expensive. >> >> >> Even without any concerns of BNode isomorphism comparing two RDF dumps >>to >>> create a delta could be a potentially very time consuming operation and >>> recording the deltas as changes happen may be far more efficient. Of >>> course depending on the exact use case the RDF dump and compute delta >>> approach may be acceptable. >>> >> >> It isn't a delta in the set theory A\B sense - nor is it a diff (it's >>not >> reversible without the additional condition). "delta" and "diff" are >>both >> names I've toyed with - "RDF changes" might better capture the idea. Or >> "RDF Changes Log". >> >> >> My main criticism is on the "Minimise actions" section, there needs to >>be >>> a more solid clarification of definitions and when minimization can and >>> should happen. >>> >> >> Yes - it isn't as well covered in the doc. >> >> Logically - or generally - in teh event generating dataset wrpapper: >> >> if ( contains(g,s,p,o) ) { >> record(QuadAction.NO_ADD,g,s,**p,o) ; // No action. >> return ; >> } >> >> add(g,s,p,o) ; >> record(QuadAction.ADD,g,s,p,o) ; // Action. >> >> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/** >> >>projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/jav >>a/projects/recorder> >> >> but implementations like TDB can do it without the contains() as the >> indexes already return true/false for whether a change occurred or not. >> >> >> >>> For example: >>> >>> "When written in minimise form the RDF Delta can be run backwards, to >>>undo >>> a change. This only works when real changes are recorded because >>>otherwise >>> knowing a triple is added does not mean it was not there before." >>> >>> While I agree it is necessary to record real changes for deltas to be >>> reverse applied I'm not convinced they have to be in minimized form (at >>> least based on how the definition of minimized form reads right now), >>>if >>> only real changes are recorded then deltas will be in a minimal form. >>> >>> Yet it is not entirely clear by your definition the following delta >>>would >>> be considered minimal: >>> >>> A <http://s> <http://p> <http://o> >>> R <http://s> <http://p> <http://o> >>> A <http://s> <http://p> <http://o> >>> >> >> If the dataset did not originally contain <http://s> <http://p> >><http://o> >> then that is minimal. Each row makes a real change ; it's the fast that >> graphs/datasets are set of triples/quads that the real change is needed. >> >> >> I'm assuming that your intention was that such deltas should not be >>> minimized but perhaps this needs to be more clear in the document. >>> >> >> There is no reason not to allow the redundant first two A-D to be >>removed >> but it's not required. >> >> >> On the topic of related work: >>> >>> I think I may have mentioned previously that I've done some research >>>work >>> internally here at YarcData on a general purpose binary serialization >>>for >>> Triples, Quads and Tuples which likely could be fairly trivially >>>extended >>> to carry a binary encoding of the deltas as well which may save space. >>> For ball park comparison purposes compression is roughly equivalent to >>> GZipping raw NTriples with the key advantage being that the format is >>> significantly faster to process even in its current prototype single >>> threaded implementation (the design was written to take advantage of >>> parallelism). There are a bunch of further optimizations that I had >>>ideas >>> for that I never got as far as implementing because of lack of >>>management >>> support for the concept. >>> >> >> My experience is that the cost of writing gzip is an appreciable >>slowdown. >> If your binary form removes that cost it would help full backups quiet >>a >> lot. >> >> >> There has been some discussion of open sourcing this work (likely as a >>> contributed Experimental module to Jena) so that it could be developed >>> outside of the company, if this sounds like it may be of interest I >>>will >>> broach the subject with relevant management again and see whether this >>>can >>> happen in the near future. >>> >> >> Please do. I find the style of having a text form and a binary form >>makes >> system building easier. Text files to debug; binary for production. >> >> We can add e.g. .ntz and .nqz to the known formats -- modules can add >> language, syntaxes, parsers and writers. The JSON-LD module does, so I >> know it does work from outside; all the built-in ones actually register >> themselves the same way and have no specials. >> >> >Rob: > >I would definitely be interested in a binary format for both triples and >quads. In fact, if it could be generalized to handle arbitrarily sized >RDF >tuples, that would be even better. I would like to replace the current >text-based solution used for the spill-to-disk functionality. > >Andy: >I like what you've done and think it could be very useful. One >suggestion: >the order of the tuples should be <s> <p> <o> <g> to match the N-Quads >format [1]. > > >-Stephen > >[1] http://www.w3.org/TR/n-quads/
