For what I do in Hadoop, I don't care about the sort order so long as, in some controlled domain, nodes always sort in the same order. This is sufficient to group triples that have the same ?s, ?p, or ?o together which is good for grouping on relationships, joining, etc. Something stupid but fast would be good for that.
The next step up is the SPARQL ordering, which is a bit iffy http://www.w3.org/TR/sparql11-query/#modOrderBy People who are picky about the answers they get will need to define their own sort order, either by putting in a custom sort order (which usually won't be fast) or using a static data type (i.e. WritableInteger) for the key. --- I'd like to have some way of processing triples in Hadoop which avoids UTF8 -> String conversion if at all possible. Often a map job filters out triples or tuples with a selectivity of 1% or so, so in a case like that you don't want to do any work you don't need to, say, test the predicate. --- As for representing URIs a very efficient way is to compute the cumulative probability distribution of the URIs, which, surprisingly, can be computed in parallel for real-world cases https://github.com/paulhoule/infovore/wiki/Design-of-a-data-processing-path You can then code these with a variable-length code which can then be treated as an opaque identifier. This is insanely fast if you want to do PageRank-style graph calculations, but it does mean joining if you want to ask questions about the string represenation of the URI. ᐧ On Fri, Jun 20, 2014 at 6:18 AM, Andy Seaborne <[email protected]> wrote: > On 20/06/14 09:48, Rob Vesse wrote: >> >> Andy >> >> Comments inline: > > > Ditto. > > >> >> On 19/06/2014 17:06, "Andy Seaborne" <[email protected]> wrote: >> >>> Lizard needs to do network transfer of RDF data. Rather than just doing >>> something specific to Lizard, I've started on a general binary RDF >>> module using Apache Thrift. >>> >>> == RDF-Thrift >>> Work in Progress :: https://github.com/afs/rdf-thrift/ >>> >>> Discussion welcome. >>> >>> >>> The current is to have three supported abstractions: >>> >>> 1. StreamRDF >>> 2. SPARQL Result Sets >>> 3. RDF patch (which is very like StreamRDF but with A and D markers). >>> >>> A first pass for StreamRDF is done including some attempts to reduce >>> objetc churn when crossing the abstract boundaries. Abstract is all very >>> well but repeated conversion of datastructures can slow things down. >>> >>> Using StreamRDF means that prefix compression can be done. >>> >>> See >>> https://github.com/afs/rdf-thrift/blob/master/RDF.thrift >>> for the encoding at the moment for just RDF. >> >> >> Looks like a sane encoding from what I understand of Thrift > > > Thanks - it's my first real use of Thrift. There are choices and I hope to > do a similar-but-different design. This one flattens everything into a > tagged RDF_Term - that skips a layer of objects that a union of RDF_IRI, > RDF_BNODE, RDF_Literal,... has. Little on-the-wire difference, less Java > object churn, maybe over engineering :-) > > >>> == In Jena >>> >>> There are a number of places this might be useful: >>> >>> 1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift" >>> >>> (oh dear, "application/x-thrift", "x-" is not encouraged any more due to >>> the transition problem c.f. "application/x-www-form-urlencoded") >>> >>> 2/ Hadoop-RDF >>> >>> This is currently using N-Triple/N-Quads. Rob - presumably this would >>> be useful eventually. AbstractNodeTupleWritable / >>> AbstractNLineFileInputFormat look about right to be but that's from >>> code-reading not code-doing. >> >> >> Yes and No >> >> The concerns on Hadoop are somewhat different. It is >> advantageous/required that the Hadoop code has direct control over the >> binary serialisation because of the contract for Writable. This is needed >> both to support serialisation and deserialisation of values and in order >> to optionally provide direct comparisons on the binary representation >> terms which has substantial performance benefits because it avoids having >> to unnecessarily deserialise terms. >> >> It is unclear to me whether using RDF Thrift would allow this or not? Or >> if the overhead of Thrift would be more overall? > > > The RDF thrift format is binary comparable if the same TProtocol is used. > TProtocol is Thriftism for the choice of wire layout - Binary, Comopact JSON > or Tuples (more compact with less resilience) - and ends have to agree the > TProtocol for interworking. Normally, one would just use "compact". > > As Thrift is used in a Hadoop setting, there should be places to go and > learn from other people's practical experience. > > >> Certainly it would be possible to support a RDF Thrift based binary RDF as >> an input & output format regardless of how the writables are defined >> >>> >>> (I know you/Cray have some internal binary RDF) >> >> >> Yes though the intent of that format is somewhat different. It was >> designed to be a parallel friendly RDF specific compression format so >> besides a global header at the start of the stream it is then block >> oriented such that each block is entirely independent of each other and >> requires only the data in the global header and itself in order to permit >> decompression. >> >> For small data there will be little/no benefit, for large data the >> compression achieved is roughly equivalent to GZipped NTriples with the >> primary advantage that it is substantially faster to produce (about 5x) >> and potentially even faster given a good parallel implementation. Of >> course what we have is mostly just a prototype and it hasn't been heavily >> optimised so there may be more performance to be had. > > > Thanks for the description. RDF binary uses include several ones that are > write-once-read-once. Compression other than applying prefixes is not the > target here (it's orthogonal?). "snappy" would be the obvious choice to > look at for a single stream because of the compression time costs of gzip. > > >> >>> >>> 3/ Data bags and spill to disk >>> >>> 4/ RDF patch >>> >>> 5/ TDB (v2 - it would be a disk change) could useful use the RDF term >>> encoding for the node table. >> >> >> Would this actually save much space? >> >> It looks like you'd only save a few bytes because you still have to store >> the bulk of the term encoding you just lose some of the surface syntax >> that something like a NTriples encoding would give you > > > For TDB the big win is speed, not space. At the moment, the on-disk node > format is a string that needs parsing and producing by string bashing. > > Both are relative expensive and the thing that limits load performance for > medium sized > datasets is the node table. The node cache largely hides the cost during > SPARQL execution. > > In Lizard, storing Thrift means that remote retrieval is simply > disk-bytes to network-bytes - no decode-encode in the node table storage > server. > > Andy > > >> >> Rob >> >>> >>> 5/ Files. Add to RIOT as a new syntax (a fairly direct access to >>> StreamRDF+Thrift) which then helps TDB loading. >>> >>> 6/ Caching results set in queries in Fuseki. >>> >>> In an ideal world, the Thrift format could be shared across toolkits. >>> There is nothing Jena specific about the wire encoding. > > ... >>> >>> >>> Andy >> >> >> >> >> > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype [email protected]
