Paul Houle wrote:
>       I'm looking at my sample some more.  Here's the distribution of 
> toplevel types from the dbpedia ontology
>
> +-----------------------------------+----------+
> | type                              | count(*) |
> +-----------------------------------+----------+
> | SupremeCourtOfTheUnitedStatesCase |        3 |
> | Website                           |        4 |
> | Event                             |       21 |
> | Infrastructure                    |       47 |
> | Work                              |      525 |
> | Organisation                      |      649 |
> | Place                             |      712 |
> | Person                            |     2208 |
> | NULL                              |     6961 |
> +-----------------------------------+----------+
>
>   
    I used the new simplified dump from metaweb to do the same thing 
with freebase.  Lacking a proper schema dump,  I simply assumed that the 
toplevel type was the most prevalent type (other than /common/topic) 
that applies to a topic:

+---------------------------------------------------+----------+
| url                                               | count(*) |
+---------------------------------------------------+----------+
| /people/person                                    |     4066 |
| NULL                                              |     3756 |
| /location/location                                |     1211 |
| /business/employer                                |      827 |
| /film/film                                        |      427 |
| /projects/project_focus                           |      268 |
| /time/event                                       |       46 |
| /organization/organization                        |       46 |
| /transportation/road                              |       44 |
| /architecture/museum                              |       41 |
| /broadcast/broadcast                              |       40 |
| /music/artist                                     |       33 |
| /time/recurring_event                             |       30 |
| /music/album                                      |       27 |
| /book/written_work                                |       25 |
| /book/periodical                                  |       22 |
| /education/educational_institution                |       14 |
| /base/dance/topic                                 |       12 |
| /business/business_location                       |       11 |
| /tv/tv_program                                    |       11 |
| /sports/sports_team                               |        9 |
| /boats/ship                                       |        9 |
| /metropolitan_transit/transit_line                |        7 |
| /base/amusementparks/topic                        |        7 |
| /business/company                                 |        7 |
| /book/author                                      |        6 |
| /visual_art/artwork                               |        5 |
| /user/robert/area_codes/topic                     |        5 |
| /book/book_subject                                |        5 |
| /food/dish                                        |        4 |
| /architecture/structure                           |        4 |
| /transportation/bridge                            |        4 |
| /business/shopping_center                         |        4 |
| /sports/sports_facility                           |        3 |
| /film/film_location                               |        3 |
| /medicine/hospital                                |        3 |
| /music/genre                                      |        3 |
| /award/award                                      |        3 |
| /music/composition                                |        3 |
| /award/award_winner                               |        3 |
| /protected_sites/protected_site                   |        3 |
| /award/award_category                             |        2 |
| /government/government_agency                     |        2 |
| /tv/tv_network                                    |        2 |
| /base/disaster2/topic                             |        2 |
| /user/skud/legal/topic                            |        2 |
| /education/school                                 |        2 |
| /internet/website                                 |        2 |
| /base/dance/dance_company                         |        2 |
| /government/governmental_body                     |        2 |
| /architecture/landscape_project                   |        2 |
| /biology/organism                                 |        2 |
| /geography/body_of_water                          |        2 |
| /theater/theater_company                          |        2 |
| /book/school_or_movement                          |        2 |
| /user/skud/names/namesake                         |        2 |
| /military/armed_force                             |        1 |
| /projects/project                                 |        1 |
| /user/iubookgirl/default_domain/academic_library  |        1 |
| /geography/island                                 |        1 |
| /influence/influence_node                         |        1 |
| /base/fblinux/topic                               |        1 |
| /film/writer                                      |        1 |
| /user/rcheramy/default_domain/nickname            |        1 |
| /award/award_presenting_organization              |        1 |
| /architecture/unrealized_design                   |        1 |
| /base/americancomedy/comedy_venue                 |        1 |
| /base/collectives/topic                           |        1 |
| /games/game                                       |        1 |
| /broadcast/radio_station                          |        1 |
| /cvg/cvg_developer                                |        1 |
| /base/omgfun/festival_series                      |        1 |
| /award/award_nominee                              |        1 |
| /user/petroleumj/default_domain/subway_station    |        1 |
| /business/job_title                               |        1 |
| /user/skud/flags/topic                            |        1 |
| /visual_art/art_subject                           |        1 |
| /user/tsegaran/random/topic                       |        1 |
| /book/magazine                                    |        1 |
| /user/techgnostic/default_domain/periodical       |        1 |
| /food/brewery_brand_of_beer                       |        1 |
| /geography/bay                                    |        1 |
| /metropolitan_transit/transit_system              |        1 |
| /internet/website_owner                           |        1 |
| /visual_art/art_owner                             |        1 |
| /computer/software_developer                      |        1 |
| /fictional_universe/fictional_character_creator   |        1 |
| /venture_capital/venture_investor                 |        1 |
| /base/omgfun/topic                                |        1 |
| /award/hall_of_fame                               |        1 |
| /base/exhibitions/topic                           |        1 |
| /base/symbols/topic                               |        1 |
| /architecture/architectural_structure_owner       |        1 |
| /aviation/airliner_accident                       |        1 |
| /guid/9202a8c04000641f800000000af896ba            |        1 |
| /user/guidewire/default_domain/online_music_store |        1 |
| /library/public_library_system                    |        1 |
| /user/gogza/default_domain/recurring_event        |        1 |
| /base/americancomedy/topic                        |        1 |
+---------------------------------------------------+----------+


(Note that this is over a list of about 11k topics that I'm doing work 
on to improve the classification of before I feed it into the next stage 
of my production pipeline)

Freebase has types for about twice the number of people,  and has about 
half the number of untypeds as dbpedia.  The freebase "toplevels" I'm 
generating are completely uncontrolled so they you get some strange ones 
towards the bottom:  the "prevalance" filter has gotten rid of a large 
number of references to certain common junk types such as the "Jungle" 
type that you find all over the place in Freebase.

Note that the URL structure of "commons" types on FB tends to be

{problem_domain}/{type}

so you tend to see things like "book/author" where there is no 
inheritance relation between book and author.  You also see "/base/..." 
types and "/user/.." types which represent namespaces inside FB.

I'm going to look at the double-untyped a bit more and also merge the fb 
types into the dbpedia toplevels.



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to