Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Sergio Fernández Fri, 27 Mar 2015 00:59:52 -0700

Hi Junyue,

the proposal looks good, I hope it could be a successful GSoC project.

Just one comment about the background: the Marmotta exchangeablebackends are not the main motivation. Of course having native HDTsupport in Sesame would allow you to build specialized backend forspecific purposes where HDT fits quite well. But that's just a sideeffect I'd say. The main goal is the HDT support itself.


Please, if you have time update your proposal addressing such issue.

Thanks.

Cheers,


On 26/03/15 17:29, Junyue Wang wrote:

Hi Sergio and Peter:

http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/junyuew/5629499534213120

Please have a look at the above proposal. I composed it from our short
discussions. Thank you!
Time is urgent. Your quick comments are appreciated.

yours,
junyue


On Thu, Mar 26, 2015 at 6:59 AM, Peter Ansell <[email protected]>
wrote:

Just one note, don't copy any code out of SPARQL-BED, as it is AGPL
licensed which Apache is not compatible with. Not sure what the
guidelines are for using it as a reference though. If you want to be
on the safe side with Apache licensing, you could just look at the
internal Sesame Sail implementations:


https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/sail/?at=master

Cheers,

Peter

On 26 March 2015 at 09:56, Peter Ansell <[email protected]> wrote:

Hi Junyue,

Thanks for your interest in the project. See my comments inline below.

On 20 March 2015 at 03:38, Junyue Wang <[email protected]> wrote:

Hello all,

As a master student major in semantic web, I'm very interested in the

GSoC

2015 project of MARMOTTA-593 [1]. I'm made some code studies on Sesame

RIO

and RDF HDT. I know how to implement from scratch the Sesame RIO
infrastructure. As to RDF HDT, here're some basic ideas of the
implementation in this project, for which your comments are very

welcome:


1) RDFParser for HDT
As is shown in [2], the HDT RDFParser can search all the triples in the
HDT, and then transform each TripleString into Statement, something

like:

IteratorTripleString it = hdt.search("", "", "");
while(it.hasNext()) {
         TripleString ts = it.next();
         ... // transfrom ts into a Statement
         ... // sink the Statement to RDFHandler
}


That looks good to me.

In addition, the HDT RDFParser should be registered into Rio beforehand,
for a new RDFFormat, so that :
Rio.createParser(RDFFormat.HDT); // for .hdt files


Sesame is setup so that you can add your own formats without having to
get a constant added to RDFFormat. Of course, in the long term we will
get a constant added for HDT to RDFFormat, but in the shortterm, you
can create your own definition of it locally.

Registering the parser is done using META-INF/services/ files that
link to RDFParserFactory and RDFWriterFactory classes. See the
following examples for RDF/XML:

https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/rio/rdfxml/src/main/resources/META-INF/services/?at=master


Once you create the META-INF/services files,
Rio.createParser(HDTFormat.HDT) should work (as long as you used that
constant as the key for the RDFParserFactory/etc.

2) RDFWriter for HDT
As is illustrated in [3] (hdt.HDT#saveToHDT), There are 4 steps to write
into HDT: GLOBAL, HEADER, DICTIONARY, and TRIPLES at last. So we have

the

first 3 steps in HDT RDFWriter.startRDF(), with the last one in
HDT RDFWriter.handleStatement() (borrowing codes from

TriplesPrivate.save()

). Nothing should be done in endRDF().

3) RDFHandler for HDT (not required)
No other RDFHandler is required for HDT. Note that RDFWriter itself is-a
RDFHandler, which is 2). But other RDFHandler is out of the scope of

this

GSoC project. Right?


Yes, you are correct, once you have an RDFWriter and RDFParser the
input/output section will be complete.

4) Query support for HDT (not requried)
Sesame RIO does not involve querying component (e.g. SPARQL). Therefore,
this GSoC project will not address Sesame query part for HDT. Am I

correct?


Query support would be done by implementing the Sail interface, which
can then be queried using SPARQL by placing the SailRepository wrapper
on top of it.

One example of a custom extended Sail that you may use as a reference
is an interface for the BED format that Jerven Bolleman created,
although if it doesn't exactly fit your case, feel free to ask for
other advice:

https://github.com/JervenBolleman/sparql-bed/tree/master/sparql-bed/src/main/java/ch/isbsib/sparql/bed

Last question: this project seems just related to Sesame and RDF HDT,

how

does it benefit Marmotta?


Marmotta benefits from now supporting the HDT format for both input
and output. The RDF community generally picks concrete formats based
on the best candidate for a particular task, so HDT may be more
suitable than N-Quads for bulk data for some tasks, but N-Quads can be
processed in a streaming fashion and can compress relatively well
using streaming compresison if necessary. Comparatively, hand-edited
RDF files are generally done in Turtle these days, although there are
still quite a few RDF/XML hand edited files, possibly because there
are many examples available for that format.

Thanks,

Peter


--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: [email protected]
w: http://redlink.co

Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Reply via email to