Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Peter Ansell Wed, 25 Mar 2015 15:59:24 -0700

Hi Junyue,

Thanks for your interest in the project. See my comments inline below.

On 20 March 2015 at 03:38, Junyue Wang <[email protected]> wrote:
> Hello all,
>
> As a master student major in semantic web, I'm very interested in the GSoC
> 2015 project of MARMOTTA-593 [1]. I'm made some code studies on Sesame RIO
> and RDF HDT. I know how to implement from scratch the Sesame RIO
> infrastructure. As to RDF HDT, here're some basic ideas of the
> implementation in this project, for which your comments are very welcome:
>
> 1) RDFParser for HDT
> As is shown in [2], the HDT RDFParser can search all the triples in the
> HDT, and then transform each TripleString into Statement, something like:
> IteratorTripleString it = hdt.search("", "", "");
> while(it.hasNext()) {
>         TripleString ts = it.next();
>         ... // transfrom ts into a Statement
>         ... // sink the Statement to RDFHandler
> }

That looks good to me.

> In addition, the HDT RDFParser should be registered into Rio beforehand,
> for a new RDFFormat, so that :
> Rio.createParser(RDFFormat.HDT); // for .hdt files

Sesame is setup so that you can add your own formats without having to
get a constant added to RDFFormat. Of course, in the long term we will
get a constant added for HDT to RDFFormat, but in the shortterm, you
can create your own definition of it locally.

Registering the parser is done using META-INF/services/ files that
link to RDFParserFactory and RDFWriterFactory classes. See the
following examples for RDF/XML:

https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/rio/rdfxml/src/main/resources/META-INF/services/?at=master

Once you create the META-INF/services files,
Rio.createParser(HDTFormat.HDT) should work (as long as you used that
constant as the key for the RDFParserFactory/etc.

> 2) RDFWriter for HDT
> As is illustrated in [3] (hdt.HDT#saveToHDT), There are 4 steps to write
> into HDT: GLOBAL, HEADER, DICTIONARY, and TRIPLES at last. So we have the
> first 3 steps in HDT RDFWriter.startRDF(), with the last one in
> HDT RDFWriter.handleStatement() (borrowing codes from TriplesPrivate.save()
> ). Nothing should be done in endRDF().
>
> 3) RDFHandler for HDT (not required)
> No other RDFHandler is required for HDT. Note that RDFWriter itself is-a
> RDFHandler, which is 2). But other RDFHandler is out of the scope of this
> GSoC project. Right?

Yes, you are correct, once you have an RDFWriter and RDFParser the
input/output section will be complete.

> 4) Query support for HDT (not requried)
> Sesame RIO does not involve querying component (e.g. SPARQL). Therefore,
> this GSoC project will not address Sesame query part for HDT. Am I correct?

Query support would be done by implementing the Sail interface, which
can then be queried using SPARQL by placing the SailRepository wrapper
on top of it.

One example of a custom extended Sail that you may use as a reference
is an interface for the BED format that Jerven Bolleman created,
although if it doesn't exactly fit your case, feel free to ask for
other advice:

https://github.com/JervenBolleman/sparql-bed/tree/master/sparql-bed/src/main/java/ch/isbsib/sparql/bed

> Last question: this project seems just related to Sesame and RDF HDT, how
> does it benefit Marmotta?

Marmotta benefits from now supporting the HDT format for both input
and output. The RDF community generally picks concrete formats based
on the best candidate for a particular task, so HDT may be more
suitable than N-Quads for bulk data for some tasks, but N-Quads can be
processed in a streaming fashion and can compress relatively well
using streaming compresison if necessary. Comparatively, hand-edited
RDF files are generally done in Turtle these days, although there are
still quite a few RDF/XML hand edited files, possibly because there
are many examples available for that format.

Thanks,

Peter

Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Reply via email to