Hi DBPedia-Community,

I'm currently writing my Master-Thesis in the field of DBPedia and SPARQL. One of my subgoals is to find out how many categories are present in both Wikipedia and DBPedia. Therefore, I wrote a little tool which identifies all categories having at least one resource in the unspecific mapping based part of DBPedia (If I refer to DBPedia in this mail, I usually mean this part of DBPedia not the whole one.). It searches the file mapping_based_properties_en.nt and looks whether or not the object and subject of each statement is linked to a category in the file article_categories_en.nt. If there is a link, the tool considers the corresponding category to be 'present' in DBPedia.

On the other hand, the same tool searches the page_links_en.nt file to find all categories of Wikipedia. That is, all triples which relate a resource to a category or (if present at all) a category to any object. According to the description of the 'Page Links Extractor' it 'Extracts internal links between DBpedia instances from the internal pagelinks between Wikipedia articles.'. As Wikipedia pages normally link to their categories, I assumed that these links are also included and, thus, all categories in Wikipedia are captured.

Unfourtnately, this is only true for almost all categories. I found 127 categories which are present in DBPedia but not in Wikipedia, compared to 59099 categories present in Wikipedia and not in DBPedia. This is strange, as the set of DBPedia categories must be a subset of Wikipedia categories. Otherwise, some magic added some new categories during extraction and I doubt that. I made sure, it was not my fault and had a look on the data. One of the suddenly appeared categories is http://dbpedia.org/resource/Category:Alaska_elections,_1996. On the DBPediasian side, there is a triple (<http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Alaska_elections,_1996> .) which relates this category to the United states Senate election in Alaska in 1996. The resource itself is subject of two statements in mapping_based_properties_en.nt. On the Wikipediasian side, I did not find any triple in page_links_en.nt which contained the category. But I did find the United states senate election in Alaska in 1996 resource. The corresponding Wikipedia page also includes a link to the category. It is present since page creation.

What is the reason for this ?
* Is my assumption wrong, that the internal pagelinks also include links to categories ?
    * If yes
        * Why were almost all categories captured ?
* Should I use the article_categories_en.nt file for Wikipedia, too ? * Did the Pagelinks Extractor skip corresponsing LinkNode during traversal of the AST ?
* Does the extraction source miss this information ?

I'm looking forward to your answers.

Regards,
Gregor Trefs
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to