Hi everyone,

Is the following statement found on the DBPedia homepage correct ? 
"The DBpedia knowledge base currently describes more than 2.6 million things, 
including at least 213,000 persons, 328,000 places..."
If so, it means that a bit more than 213,000 persons are identified as such in 
DBPedia. 
I am working on extracting people names from Wikipedia and found over 500,000 
person names (500679 to be exact) using the following (simple) method: 
1. build a list of occupations (terms like accountant, actor, actress, actuary) 
from http://en.wikipedia.org/wiki/List_of_occupations and a list of 
nationalities (terms like afghani, albanian, algerian) from 
http://en.wikipedia.org/wiki/List_of_nationalities
2. go through the list of Wikipedia articles and consider that the article is 
about a person if its categories contain at least one term from the list of 
occupations AND at least one from the list of nationalities. 
From my initial observations, the results are quite accurate (I would say 
precision around 98% - 99%). If I use only the list of occupations, the number 
of found articles is 599595 but there are probably more errors than if using 
both lists. I think that the method can be tuned in order to increase recall by 
using some supplementary patterns from the first sentence of the article (birth 
years or period when the respective persons lived).
Should you be interested, I can provide results samples in order for you to 
check results. 

Adrian Popescu




________________________________

I'm looking at my sample some more.  Here's the distribution of 
toplevel types from the dbpedia ontology

+-----------------------------------+----------+
| type                              | count(*) |
+-----------------------------------+----------+
| SupremeCourtOfTheUnitedStatesCase |        3 |
| Website                           |        4 |
| Event                             |       21 |
| Infrastructure                    |       47 |
| Work                              |      525 |
| Organisation                      |      649 |
| Place                             |      712 |
| Person                            |     2208 |
| NULL                              |     6961 |
+-----------------------------------+----------+


    6961 out of 11130 objects are untyped,  or about 62%.  Looking at 
the actual untyped objects,  my rough guess is that 80-90% of the 
objects could be assigned types in the dbpedia ontology,  such as

http://en.wikipedia.org/wiki/Joseph_Pulitzer

    I'm sure the metaweb people will gloat that he's typed in freebase.

http://www.freebase.com/view/en/joseph_pulitzer

    Maybe half of the untyped items I see are People,  but I see some 
Works,  Places,  etc.

    I'm going to line these up with FB types and see what happens.



      
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to