Hello Peter, The problem is, there are no up-to-date, complete and detailed specifications of RDF HDT. The W3C submission [1] in 2011 is out of date. The documentation [2] is new ,but it contains just general description without much details. For example, the first few bytes of a RDF HDT file are "Global ControlInformation", but neither of the above 2 docs mention the details. For the "Global ControlInformation", the format information should be "<http://purl.org/HDT/hdt#HDTv1>", but there's no such information in either of the docs.
I've tried to ask for the up-to-date specification from the authors of RDF HDT. I've also inquired the licence issue in @legal-discuss. But none useful reply comes out until now. In order to code the parser from scratch, I had to study the source code of HDT Java implementation (LGPL Licence), or more explicitly, HDTImpl.java [3]. Then I re-writed the code in my own way with the same functionality. For example, ControlInformation in HDT Java implementation is coded in Object-Oriented way, but I made it just using some functions/methods, with much of the idea inspired from BinaryRDFParser [4] in Sesame (BSD License?). However I borrowed some code of low-level byte processing from HDT Java implementation. Is this way OK with the licence issue? yours, Junyue [1] http://www.w3.org/Submission/2011/SUBM-HDT-20110330/ [2] http://www.rdfhdt.org/hdt-internals/ [3] https://github.com/rdfhdt/hdt-java/blob/master/hdt-java-core/src/main/java/org/rdfhdt/hdt/hdt/impl/HDTImpl.java [4] http://grepcode.com/file/repo1.maven.org/maven2/org.openrdf.sesame/sesame-rio-binary/2.7.14/org/openrdf/rio/binary/BinaryRDFParser.java/ On Sun, Jun 28, 2015 at 3:34 PM, Peter Ansell <[email protected]> wrote: > Hi Junyue, > > Thanks for the update. See some comments inline below. > > On 28 June 2015 at 00:17, Junyue Wang <[email protected]> wrote: > > Hi Peter, Sergio, > > > > I'm here to summarize the status for the first-half part of the GSoC > > project: > > > > 1. Test data preparation > > It's useful to have test data of hdt files prepared for testing the new > > parser. But the dataset from [1] are too big for small tests. So I > borrowed > > some examples from W3C RDF documentation [2]. I used HDT java > implementation > > to transform example02.rdf~20.rdf into test02.hdt~20.hdt in the code base > > [3] > > Having small tight examples is vital for unit testing, so that sounds > good to me, as long as the current spec is backwards compatible with > it. > > > 2. HDT RDF parser based on HDT java implementation > > I'm sorry that the project goal was misunderstood during the project > > proposal period. In the first few weeks of the project, I was devoted to > > code the HDT RDF parser based on HDT java implementation. I also sent > email > > to legal-discuss@, for clarifying the licence issue, but no response > showed > > up until now. Anyway, I committed the code [4], in case it may be useful > in > > future. > > We can always rebase that commit out when contributing the final patch > back, if it is an issue. > > > 3. HDT RDF parser from scratch > > I've began to code the HDT RDF parser from scratch. Now the new parser > can > > parse the Global Information of the hdt files [5]. I'll continue in this > way > > for the next half-part of the project. > > That looks like a good start. See how you go after that parsing the > other two sections and do let us know if you have any issues or > queries. > > Thanks, > > Peter > > > yours, > > Junyue > > > > [1] http://www.rdfhdt.org/datasets/ > > [2] https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-xml/index.html > > [3] > > > https://github.com/junyuew/marmotta/tree/MARMOTTA-593/commons/marmotta-sesame-tools/marmotta-rio-rdfhdt/src/test/resources/org/apache/marmotta/commons/sesame/rio/rdfhdt > > [4] > > > https://github.com/junyuew/marmotta/commit/e4b5d7492f102711c1227f592a36e26353f33812 > > [5] > > > https://github.com/junyuew/marmotta/commit/a7711b8338aafda9d812f0f2bb98cbde53a7cefa > > > > >
