Re: [Dbpedia-discussion] questions about the extraction process

Jens Lehmann Mon, 14 Jul 2008 00:39:51 -0700

Hello,

Shug Boabby schrieb:
> Hi everybody,
> 
> I just came across DBPedia and it looks like a really awesome project.
> I noticed that the latest dump from the English Wikipedia is quite old
> (January 2008, Freebase does a new release every 3 months), so I am
> interested in creating a more up-to-date dbpedia from a recent
> download of the Wikipedia pages and articles (which I have). Is this
> possible?


Yes.

> I am most interested in knowing which Wikipedia pages are people,
> companies, and possible disambiguations. Perhaps if there are URLs of
> the person (or company logo) on Wikipedia, that would be good too.

You could use the YAGO hierarchy for this (which will be much improved
in the next DBpedia release). Using subclass inferencing in Virtuoso,
you can ask the DBpedia SPARQL endpoint for all individuals, which are
instances of the class you are looking for, i.e. ask for instances of
http://dbpedia.org/class/yago/Person100007846 if you want to find all
persons. (Note that there are resource limits on the server, so I
believe it will only return 1000 results at a time. You need to ask
several queries to get all persons.)

> I downloaded the SVN from sourceforge, but I'm afraid I am a little
> lost as I was unable to find any documentation. I have a suspicion
> that the entry point is the extraction/extract.php file... but I have
> no idea where to put the bz2 wikipedia dump file.

You need to import all the dumps in a MySQL database. There is an import
PHP script provided to do this. (Also see other threads in this list.)

The extract.php file is the entry point for the complete extraction
process. You may alter it to extract only what you are looking for.

Some developer information about the framework can be found on
http://wiki.dbpedia.org/Documentation. There is, unfortunately, no good
documentation on creating your own DBpedia release. It is a very
time-consuming process.

> Additionally, I noticed that the PersonData preview link is broken.

I fixed it.

> This is disappointing as it is the one I am most interested in (so I
> had to download the full dataset). Is there any reason why this is
> only created from the German data?

There is no strict reason not to include it. I believe the extractor had
problems on the English Wikipedia, which need to be fixed. It might be
included in the next release.

Kind regards,

Jens


-- 
Dipl. Inf. Jens Lehmann
Department of Computer Science, University of Leipzig
Homepage: http://www.jens-lehmann.org
GPG Key: http://jens-lehmann.org/jens_lehmann.asc

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] questions about the extraction process

Reply via email to