Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
2013/12/4 Paul Houle ontolo...@gmail.com I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
@Paul, unfortunately HTML wikipedia dumps are not released anymore (they are old static dumps as you said). This is a problem for a project like DBpedia, as you can easily understand. Moreover, I did not mean that it is not possible to crawl Wikipedia instances or load dump into a private

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle
The DBpedia Way of extracting the citations probably would be to build something that treats the citations the way infoboxes are treated. It's one way of doing things, and it has it's own integrity, but it's not the way I do things. (DBpedia does it this way about as well as it can be done,

[Dbpedia-discussion] parallel rdfDiff

2013-12-05 Thread Paul Houle
I just released a version of Infovore that can do scalable differencing of RDF data sets, producing output in the RDF Patch format http://afs.github.io/rdf-patch/ The tool is written up here https://github.com/paulhoule/infovore/wiki/rdfDiff I ran this against two different weeks of Freebase