Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Junyue Wang Thu, 26 Mar 2015 09:30:03 -0700

Hi Sergio and Peter:

http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/junyuew/5629499534213120


Please have a look at the above proposal. I composed it from our short
discussions. Thank you!
Time is urgent. Your quick comments are appreciated.

yours,
junyue


On Thu, Mar 26, 2015 at 6:59 AM, Peter Ansell <[email protected]>
wrote:

> Just one note, don't copy any code out of SPARQL-BED, as it is AGPL
> licensed which Apache is not compatible with. Not sure what the
> guidelines are for using it as a reference though. If you want to be
> on the safe side with Apache licensing, you could just look at the
> internal Sesame Sail implementations:
>
>
> https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/sail/?at=master
>
> Cheers,
>
> Peter
>
> On 26 March 2015 at 09:56, Peter Ansell <[email protected]> wrote:
> > Hi Junyue,
> >
> > Thanks for your interest in the project. See my comments inline below.
> >
> > On 20 March 2015 at 03:38, Junyue Wang <[email protected]> wrote:
> >> Hello all,
> >>
> >> As a master student major in semantic web, I'm very interested in the
> GSoC
> >> 2015 project of MARMOTTA-593 [1]. I'm made some code studies on Sesame
> RIO
> >> and RDF HDT. I know how to implement from scratch the Sesame RIO
> >> infrastructure. As to RDF HDT, here're some basic ideas of the
> >> implementation in this project, for which your comments are very
> welcome:
> >>
> >> 1) RDFParser for HDT
> >> As is shown in [2], the HDT RDFParser can search all the triples in the
> >> HDT, and then transform each TripleString into Statement, something
> like:
> >> IteratorTripleString it = hdt.search("", "", "");
> >> while(it.hasNext()) {
> >>         TripleString ts = it.next();
> >>         ... // transfrom ts into a Statement
> >>         ... // sink the Statement to RDFHandler
> >> }
> >
> > That looks good to me.
> >
> >> In addition, the HDT RDFParser should be registered into Rio beforehand,
> >> for a new RDFFormat, so that :
> >> Rio.createParser(RDFFormat.HDT); // for .hdt files
> >
> > Sesame is setup so that you can add your own formats without having to
> > get a constant added to RDFFormat. Of course, in the long term we will
> > get a constant added for HDT to RDFFormat, but in the shortterm, you
> > can create your own definition of it locally.
> >
> > Registering the parser is done using META-INF/services/ files that
> > link to RDFParserFactory and RDFWriterFactory classes. See the
> > following examples for RDF/XML:
> >
> >
> https://bitbucket.org/openrdf/sesame/src/db49126a8cf12c420df57d65deb843707c166651/core/rio/rdfxml/src/main/resources/META-INF/services/?at=master
> >
> > Once you create the META-INF/services files,
> > Rio.createParser(HDTFormat.HDT) should work (as long as you used that
> > constant as the key for the RDFParserFactory/etc.
> >
> >> 2) RDFWriter for HDT
> >> As is illustrated in [3] (hdt.HDT#saveToHDT), There are 4 steps to write
> >> into HDT: GLOBAL, HEADER, DICTIONARY, and TRIPLES at last. So we have
> the
> >> first 3 steps in HDT RDFWriter.startRDF(), with the last one in
> >> HDT RDFWriter.handleStatement() (borrowing codes from
> TriplesPrivate.save()
> >> ). Nothing should be done in endRDF().
> >>
> >> 3) RDFHandler for HDT (not required)
> >> No other RDFHandler is required for HDT. Note that RDFWriter itself is-a
> >> RDFHandler, which is 2). But other RDFHandler is out of the scope of
> this
> >> GSoC project. Right?
> >
> > Yes, you are correct, once you have an RDFWriter and RDFParser the
> > input/output section will be complete.
> >
> >> 4) Query support for HDT (not requried)
> >> Sesame RIO does not involve querying component (e.g. SPARQL). Therefore,
> >> this GSoC project will not address Sesame query part for HDT. Am I
> correct?
> >
> > Query support would be done by implementing the Sail interface, which
> > can then be queried using SPARQL by placing the SailRepository wrapper
> > on top of it.
> >
> > One example of a custom extended Sail that you may use as a reference
> > is an interface for the BED format that Jerven Bolleman created,
> > although if it doesn't exactly fit your case, feel free to ask for
> > other advice:
> >
> >
> https://github.com/JervenBolleman/sparql-bed/tree/master/sparql-bed/src/main/java/ch/isbsib/sparql/bed
> >
> >> Last question: this project seems just related to Sesame and RDF HDT,
> how
> >> does it benefit Marmotta?
> >
> > Marmotta benefits from now supporting the HDT format for both input
> > and output. The RDF community generally picks concrete formats based
> > on the best candidate for a particular task, so HDT may be more
> > suitable than N-Quads for bulk data for some tasks, but N-Quads can be
> > processed in a streaming fashion and can compress relatively well
> > using streaming compresison if necessary. Comparatively, hand-edited
> > RDF files are generally done in Turtle these days, although there are
> > still quite a few RDF/XML hand edited files, possibly because there
> > are many examples available for that format.
> >
> > Thanks,
> >
> > Peter
>

Re: Discussion GSoC 2015 Project (MARMOTTA-593)

Reply via email to