On 20/06/14 09:48, Rob Vesse wrote:
Andy

Comments inline:

Ditto.


On 19/06/2014 17:06, "Andy Seaborne" <[email protected]> wrote:

Lizard needs to do network transfer of RDF data.  Rather than just doing
something specific to Lizard, I've started on a general binary RDF
module using Apache Thrift.

== RDF-Thrift
Work in Progress :: https://github.com/afs/rdf-thrift/

Discussion welcome.


The current is to have three supported abstractions:

1. StreamRDF
2. SPARQL Result Sets
3. RDF patch (which is very like StreamRDF but with A and D markers).

A first pass for StreamRDF is done including some attempts to reduce
objetc churn when crossing the abstract boundaries. Abstract is all very
well but repeated conversion of datastructures can slow things down.

Using StreamRDF means that prefix compression can be done.

See
   https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
for the encoding at the moment for just RDF.

Looks like a sane encoding from what I understand of Thrift

Thanks - it's my first real use of Thrift. There are choices and I hope to do a similar-but-different design. This one flattens everything into a tagged RDF_Term - that skips a layer of objects that a union of RDF_IRI, RDF_BNODE, RDF_Literal,... has. Little on-the-wire difference, less Java object churn, maybe over engineering :-)

== In Jena

There are a number of places this might be useful:

1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"

(oh dear, "application/x-thrift", "x-" is not encouraged any more due to
the transition problem c.f. "application/x-www-form-urlencoded")

2/ Hadoop-RDF

This is currently using N-Triple/N-Quads.  Rob - presumably this would
be useful eventually.  AbstractNodeTupleWritable /
AbstractNLineFileInputFormat look about right to be but that's from
code-reading not code-doing.

Yes and No

The concerns on Hadoop are somewhat different.  It is
advantageous/required that the Hadoop code has direct control over the
binary serialisation because of the contract for Writable.  This is needed
both to support serialisation and deserialisation of values and in order
to optionally provide direct comparisons on the binary representation
terms which has substantial performance benefits because it avoids having
to unnecessarily deserialise terms.

It is unclear to me whether using RDF Thrift would allow this or not?  Or
if the overhead of Thrift would be more overall?

The RDF thrift format is binary comparable if the same TProtocol is used. TProtocol is Thriftism for the choice of wire layout - Binary, Comopact JSON or Tuples (more compact with less resilience) - and ends have to agree the TProtocol for interworking. Normally, one would just use "compact".

As Thrift is used in a Hadoop setting, there should be places to go and learn from other people's practical experience.

Certainly it would be possible to support a RDF Thrift based binary RDF as
an input & output format regardless of how the writables are defined


(I know you/Cray have some internal binary RDF)

Yes though the intent of that format is somewhat different.  It was
designed to be a parallel friendly RDF specific compression format so
besides a global header at the start of the stream it is then block
oriented such that each block is entirely independent of each other and
requires only the data in the global header and itself in order to permit
decompression.

For small data there will be little/no benefit, for large data the
compression achieved is roughly equivalent to GZipped NTriples with the
primary advantage that it is substantially faster to produce (about 5x)
and potentially even faster given a good parallel implementation.  Of
course what we have is mostly just a prototype and it hasn't been heavily
optimised so there may be more performance to be had.

Thanks for the description. RDF binary uses include several ones that are write-once-read-once. Compression other than applying prefixes is not the target here (it's orthogonal?). "snappy" would be the obvious choice to look at for a single stream because of the compression time costs of gzip.



3/ Data bags and spill to disk

4/ RDF patch

5/ TDB (v2 - it would be a disk change) could useful use the RDF term
encoding for the node table.

Would this actually save much space?

It looks like you'd only save a few bytes because you still have to store
the bulk of the term encoding you just lose some of the surface syntax
that something like a NTriples encoding would give you

For TDB the big win is speed, not space. At the moment, the on-disk node
format is a string that needs parsing and producing by string bashing.

Both are relative expensive and the thing that limits load performance for medium sized datasets is the node table. The node cache largely hides the cost during SPARQL execution.

In Lizard, storing Thrift means that remote retrieval is simply
disk-bytes to network-bytes - no decode-encode in the node table storage
server.

        Andy


Rob


5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
StreamRDF+Thrift) which then helps TDB loading.

6/ Caching results set in queries in Fuseki.

In an ideal world, the Thrift format could be shared across toolkits.
There is nothing Jena specific about the wire encoding.
...

        Andy





Reply via email to