Hi Junyue,
the proposal looks good, I hope it could be a successful GSoC project.
Just one comment about the background: the Marmotta exchangeable
backends are not the main motivation. Of course having native HDT
support in Sesame would allow you to build specialized backend for
specific purposes where HDT fits quite well. But that's just a side
effect I'd say. The main goal is the HDT support itself.
Please, if you have time update your proposal addressing such issue.
Thanks.
Cheers,
On 26/03/15 17:29, Junyue Wang wrote:
Hi Sergio and Peter:
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/junyuew/5629499534213120
Please have a look at the above proposal. I composed it from our short
discussions. Thank you!
Time is urgent. Your quick comments are appreciated.
yours,
junyue
On Thu, Mar 26, 2015 at 6:59 AM, Peter Ansell <[email protected]>
wrote:
Just one note, don't copy any code out of SPARQL-BED, as it is AGPL
licensed which Apache is not compatible with. Not sure what the
guidelines are for using it as a reference though. If you want to be
on the safe side with Apache licensing, you could just look at the
internal Sesame Sail implementations:
https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/sail/?at=master
Cheers,
Peter
On 26 March 2015 at 09:56, Peter Ansell <[email protected]> wrote:
Hi Junyue,
Thanks for your interest in the project. See my comments inline below.
On 20 March 2015 at 03:38, Junyue Wang <[email protected]> wrote:
Hello all,
As a master student major in semantic web, I'm very interested in the
GSoC
2015 project of MARMOTTA-593 [1]. I'm made some code studies on Sesame
RIO
and RDF HDT. I know how to implement from scratch the Sesame RIO
infrastructure. As to RDF HDT, here're some basic ideas of the
implementation in this project, for which your comments are very
welcome:
1) RDFParser for HDT
As is shown in [2], the HDT RDFParser can search all the triples in the
HDT, and then transform each TripleString into Statement, something
like:
IteratorTripleString it = hdt.search("", "", "");
while(it.hasNext()) {
TripleString ts = it.next();
... // transfrom ts into a Statement
... // sink the Statement to RDFHandler
}
That looks good to me.
In addition, the HDT RDFParser should be registered into Rio beforehand,
for a new RDFFormat, so that :
Rio.createParser(RDFFormat.HDT); // for .hdt files
Sesame is setup so that you can add your own formats without having to
get a constant added to RDFFormat. Of course, in the long term we will
get a constant added for HDT to RDFFormat, but in the shortterm, you
can create your own definition of it locally.
Registering the parser is done using META-INF/services/ files that
link to RDFParserFactory and RDFWriterFactory classes. See the
following examples for RDF/XML:
https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/rio/rdfxml/src/main/resources/META-INF/services/?at=master
Once you create the META-INF/services files,
Rio.createParser(HDTFormat.HDT) should work (as long as you used that
constant as the key for the RDFParserFactory/etc.
2) RDFWriter for HDT
As is illustrated in [3] (hdt.HDT#saveToHDT), There are 4 steps to write
into HDT: GLOBAL, HEADER, DICTIONARY, and TRIPLES at last. So we have
the
first 3 steps in HDT RDFWriter.startRDF(), with the last one in
HDT RDFWriter.handleStatement() (borrowing codes from
TriplesPrivate.save()
). Nothing should be done in endRDF().
3) RDFHandler for HDT (not required)
No other RDFHandler is required for HDT. Note that RDFWriter itself is-a
RDFHandler, which is 2). But other RDFHandler is out of the scope of
this
GSoC project. Right?
Yes, you are correct, once you have an RDFWriter and RDFParser the
input/output section will be complete.
4) Query support for HDT (not requried)
Sesame RIO does not involve querying component (e.g. SPARQL). Therefore,
this GSoC project will not address Sesame query part for HDT. Am I
correct?
Query support would be done by implementing the Sail interface, which
can then be queried using SPARQL by placing the SailRepository wrapper
on top of it.
One example of a custom extended Sail that you may use as a reference
is an interface for the BED format that Jerven Bolleman created,
although if it doesn't exactly fit your case, feel free to ask for
other advice:
https://github.com/JervenBolleman/sparql-bed/tree/master/sparql-bed/src/main/java/ch/isbsib/sparql/bed
Last question: this project seems just related to Sesame and RDF HDT,
how
does it benefit Marmotta?
Marmotta benefits from now supporting the HDT format for both input
and output. The RDF community generally picks concrete formats based
on the best candidate for a particular task, so HDT may be more
suitable than N-Quads for bulk data for some tasks, but N-Quads can be
processed in a streaming fashion and can compress relatively well
using streaming compresison if necessary. Comparatively, hand-edited
RDF files are generally done in Turtle these days, although there are
still quite a few RDF/XML hand edited files, possibly because there
are many examples available for that format.
Thanks,
Peter
--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: [email protected]
w: http://redlink.co