Hi DBPedia-Community,
I'm currently writing my Master-Thesis in the field of DBPedia and
SPARQL. One of my subgoals is to find out how many categories are
present in both Wikipedia and DBPedia. Therefore, I wrote a little tool
which identifies all categories having at least one resource in the
unspecific mapping based part of DBPedia (If I refer to DBPedia in this
mail, I usually mean this part of DBPedia not the whole one.). It
searches the file mapping_based_properties_en.nt and looks whether or
not the object and subject of each statement is linked to a category in
the file article_categories_en.nt. If there is a link, the tool
considers the corresponding category to be 'present' in DBPedia.
On the other hand, the same tool searches the page_links_en.nt file to
find all categories of Wikipedia. That is, all triples which relate a
resource to a category or (if present at all) a category to any object.
According to the description of the 'Page Links Extractor' it 'Extracts
internal links between DBpedia instances from the internal pagelinks
between Wikipedia articles.'. As Wikipedia pages normally link to their
categories, I assumed that these links are also included and, thus, all
categories in Wikipedia are captured.
Unfourtnately, this is only true for almost all categories. I found 127
categories which are present in DBPedia but not in Wikipedia, compared
to 59099 categories present in Wikipedia and not in DBPedia. This is
strange, as the set of DBPedia categories must be a subset of Wikipedia
categories. Otherwise, some magic added some new categories during
extraction and I doubt that. I made sure, it was not my fault and had a
look on the data. One of the suddenly appeared categories is
http://dbpedia.org/resource/Category:Alaska_elections,_1996. On the
DBPediasian side, there is a triple
(<http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996>
<http://purl.org/dc/terms/subject>
<http://dbpedia.org/resource/Category:Alaska_elections,_1996> .) which
relates this category to the United states Senate election in Alaska in
1996. The resource itself is subject of two statements in
mapping_based_properties_en.nt. On the Wikipediasian side, I did not
find any triple in page_links_en.nt which contained the category. But I
did find the United states senate election in Alaska in 1996 resource.
The corresponding Wikipedia page also includes a link to the category.
It is present since page creation.
What is the reason for this ?
* Is my assumption wrong, that the internal pagelinks also include links
to categories ?
* If yes
* Why were almost all categories captured ?
* Should I use the article_categories_en.nt file for Wikipedia,
too ?
* Did the Pagelinks Extractor skip corresponsing LinkNode during
traversal of the AST ?
* Does the extraction source miss this information ?
I'm looking forward to your answers.
Regards,
Gregor Trefs
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion