I've been looking at the wikipedia data and noticed the following issue. There seem to be categories in articlecategories_en that don't exist in categories_label_en, for instance
http://dbpedia.org/resource/Category:The_Like_Young_albums If I look in the label file, $ bzcat ~/dbpedia3.2/categories_label_en.nt.bz2 | grep The_Like_Young I only find <http://dbpedia.org/resource/Category:The_Like_Young_songs> <http://www.w3.org/2000/01/rdf-schema#label> "The Like Young songs"@en . which doesn't match. I found about 31,695 cases like this. I could either ignore these categories or make up labels for them from looking at the URLs, but it may point to a deeper problem. I'm also thinking about enclosure relationships between categories: If I look at wikipedia, I find pages like: http://en.wikipedia.org/wiki/Category:Chemistry Note that Chemistry contains subcategories such as http://en.wikipedia.org/wiki/Category:Acid-base_chemistry Perhaps I'm missing something, but I don't see subcategory relationships kept track of in wikipedia. I know that wikipedia categories are pretty messy, but I've found that graph traversals & filtering can be applied to them to find members of classes that slip through the cracks of more rigorous taxonomies -- I used methods like that in the construction of http://carpictures.cc/ Are there any plans to improve category parsing in future dbpedia versions? ------------------------------------------------------------------------------ This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
