Hi everyone,
Is the following statement found on the DBPedia homepage correct ?
"The DBpedia knowledge base currently describes more than 2.6 million things,
including at least 213,000 persons, 328,000 places..."
If so, it means that a bit more than 213,000 persons are identified as such in
DBPedia.
I am working on extracting people names from Wikipedia and found over 500,000
person names (500679 to be exact) using the following (simple) method:
1. build a list of occupations (terms like accountant, actor, actress, actuary)
from http://en.wikipedia.org/wiki/List_of_occupations and a list of
nationalities (terms like afghani, albanian, algerian) from
http://en.wikipedia.org/wiki/List_of_nationalities
2. go through the list of Wikipedia articles and consider that the article is
about a person if its categories contain at least one term from the list of
occupations AND at least one from the list of nationalities.
From my initial observations, the results are quite accurate (I would say
precision around 98% - 99%). If I use only the list of occupations, the number
of found articles is 599595 but there are probably more errors than if using
both lists. I think that the method can be tuned in order to increase recall by
using some supplementary patterns from the first sentence of the article (birth
years or period when the respective persons lived).
Should you be interested, I can provide results samples in order for you to
check results.
Adrian Popescu
________________________________
I'm looking at my sample some more. Here's the distribution of
toplevel types from the dbpedia ontology
+-----------------------------------+----------+
| type | count(*) |
+-----------------------------------+----------+
| SupremeCourtOfTheUnitedStatesCase | 3 |
| Website | 4 |
| Event | 21 |
| Infrastructure | 47 |
| Work | 525 |
| Organisation | 649 |
| Place | 712 |
| Person | 2208 |
| NULL | 6961 |
+-----------------------------------+----------+
6961 out of 11130 objects are untyped, or about 62%. Looking at
the actual untyped objects, my rough guess is that 80-90% of the
objects could be assigned types in the dbpedia ontology, such as
http://en.wikipedia.org/wiki/Joseph_Pulitzer
I'm sure the metaweb people will gloat that he's typed in freebase.
http://www.freebase.com/view/en/joseph_pulitzer
Maybe half of the untyped items I see are People, but I see some
Works, Places, etc.
I'm going to line these up with FB types and see what happens.
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion