Paul Houle wrote:
> I'm looking at my sample some more. Here's the distribution of
> toplevel types from the dbpedia ontology
>
> +-----------------------------------+----------+
> | type | count(*) |
> +-----------------------------------+----------+
> | SupremeCourtOfTheUnitedStatesCase | 3 |
> | Website | 4 |
> | Event | 21 |
> | Infrastructure | 47 |
> | Work | 525 |
> | Organisation | 649 |
> | Place | 712 |
> | Person | 2208 |
> | NULL | 6961 |
> +-----------------------------------+----------+
>
>
I used the new simplified dump from metaweb to do the same thing
with freebase. Lacking a proper schema dump, I simply assumed that the
toplevel type was the most prevalent type (other than /common/topic)
that applies to a topic:
+---------------------------------------------------+----------+
| url | count(*) |
+---------------------------------------------------+----------+
| /people/person | 4066 |
| NULL | 3756 |
| /location/location | 1211 |
| /business/employer | 827 |
| /film/film | 427 |
| /projects/project_focus | 268 |
| /time/event | 46 |
| /organization/organization | 46 |
| /transportation/road | 44 |
| /architecture/museum | 41 |
| /broadcast/broadcast | 40 |
| /music/artist | 33 |
| /time/recurring_event | 30 |
| /music/album | 27 |
| /book/written_work | 25 |
| /book/periodical | 22 |
| /education/educational_institution | 14 |
| /base/dance/topic | 12 |
| /business/business_location | 11 |
| /tv/tv_program | 11 |
| /sports/sports_team | 9 |
| /boats/ship | 9 |
| /metropolitan_transit/transit_line | 7 |
| /base/amusementparks/topic | 7 |
| /business/company | 7 |
| /book/author | 6 |
| /visual_art/artwork | 5 |
| /user/robert/area_codes/topic | 5 |
| /book/book_subject | 5 |
| /food/dish | 4 |
| /architecture/structure | 4 |
| /transportation/bridge | 4 |
| /business/shopping_center | 4 |
| /sports/sports_facility | 3 |
| /film/film_location | 3 |
| /medicine/hospital | 3 |
| /music/genre | 3 |
| /award/award | 3 |
| /music/composition | 3 |
| /award/award_winner | 3 |
| /protected_sites/protected_site | 3 |
| /award/award_category | 2 |
| /government/government_agency | 2 |
| /tv/tv_network | 2 |
| /base/disaster2/topic | 2 |
| /user/skud/legal/topic | 2 |
| /education/school | 2 |
| /internet/website | 2 |
| /base/dance/dance_company | 2 |
| /government/governmental_body | 2 |
| /architecture/landscape_project | 2 |
| /biology/organism | 2 |
| /geography/body_of_water | 2 |
| /theater/theater_company | 2 |
| /book/school_or_movement | 2 |
| /user/skud/names/namesake | 2 |
| /military/armed_force | 1 |
| /projects/project | 1 |
| /user/iubookgirl/default_domain/academic_library | 1 |
| /geography/island | 1 |
| /influence/influence_node | 1 |
| /base/fblinux/topic | 1 |
| /film/writer | 1 |
| /user/rcheramy/default_domain/nickname | 1 |
| /award/award_presenting_organization | 1 |
| /architecture/unrealized_design | 1 |
| /base/americancomedy/comedy_venue | 1 |
| /base/collectives/topic | 1 |
| /games/game | 1 |
| /broadcast/radio_station | 1 |
| /cvg/cvg_developer | 1 |
| /base/omgfun/festival_series | 1 |
| /award/award_nominee | 1 |
| /user/petroleumj/default_domain/subway_station | 1 |
| /business/job_title | 1 |
| /user/skud/flags/topic | 1 |
| /visual_art/art_subject | 1 |
| /user/tsegaran/random/topic | 1 |
| /book/magazine | 1 |
| /user/techgnostic/default_domain/periodical | 1 |
| /food/brewery_brand_of_beer | 1 |
| /geography/bay | 1 |
| /metropolitan_transit/transit_system | 1 |
| /internet/website_owner | 1 |
| /visual_art/art_owner | 1 |
| /computer/software_developer | 1 |
| /fictional_universe/fictional_character_creator | 1 |
| /venture_capital/venture_investor | 1 |
| /base/omgfun/topic | 1 |
| /award/hall_of_fame | 1 |
| /base/exhibitions/topic | 1 |
| /base/symbols/topic | 1 |
| /architecture/architectural_structure_owner | 1 |
| /aviation/airliner_accident | 1 |
| /guid/9202a8c04000641f800000000af896ba | 1 |
| /user/guidewire/default_domain/online_music_store | 1 |
| /library/public_library_system | 1 |
| /user/gogza/default_domain/recurring_event | 1 |
| /base/americancomedy/topic | 1 |
+---------------------------------------------------+----------+
(Note that this is over a list of about 11k topics that I'm doing work
on to improve the classification of before I feed it into the next stage
of my production pipeline)
Freebase has types for about twice the number of people, and has about
half the number of untypeds as dbpedia. The freebase "toplevels" I'm
generating are completely uncontrolled so they you get some strange ones
towards the bottom: the "prevalance" filter has gotten rid of a large
number of references to certain common junk types such as the "Jungle"
type that you find all over the place in Freebase.
Note that the URL structure of "commons" types on FB tends to be
{problem_domain}/{type}
so you tend to see things like "book/author" where there is no
inheritance relation between book and author. You also see "/base/..."
types and "/user/.." types which represent namespaces inside FB.
I'm going to look at the double-untyped a bit more and also merge the fb
types into the dbpedia toplevels.
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion