Re: [Dbpedia-discussion] querying via sparql & ranking

paul Thu, 11 Oct 2012 07:04:57 -0700

    For a few years I've been using counts of inlinks from the "Wikipedia 
Pagelinks" as a subjective importance measure.  I'm now transitioning to 
something based on pageview statistics (page counts).


    I say "subjective" because this is one of those things where there isn't 
necessarily a right answer,  something can be better or worse,  but you 
can't specify everything -- sentiment analysis and full-text retrieval are 
similar domains.   For instance,  unless we were building a KB about 
Binghamton NY,  it's certainly true that the New York Yankees are more 
important than the Binghamton Mets (a minor league baseball team).  A system 
that ranks that wrong is failing and ought to go back to the developers. 
Although sports fans in the Northeast have strong opinions,  you can't say 
the system failed a test if it didn't rank the Red Sox and the Yankees the 
way that you rank them.  If you go to a sports bar in upstate NY wearing a 
Red Sox cap you'll find there's no objective answer to this one.

    If we count links or count page views we could say that more people 
looked at team A or that team B has more links,  but when we use these 
scores we're often using them as a proxy for something else.  For instance, 
a named entity disambiguator could do a better job if it knows that "Ithaca, 
NY" gets talked about much more than "Ithaca, Nebraska".  The function you 
really want is to know the rate at which an entity gets talked about in a 
stream of text -- you could get human judgements about that at great cost, 
or you could say that this correlates well to inlink count or pageview 
count.

   When you construct a KB using this sort of method you'll always find that 
you disagree with it somewhat.  For instance,  some guy named "Lil Wayne" is 
one of the top 20 people in page views.  I can say that underground Hip Hop 
from NYC is about 1/3 of the soundtrack of my life but I had to look this 
guy up in Wikipedia to see who he his.  Wikipedia says he's had more singles 
on the Billboard Hot 100 than anybody else in history,  but somehow he sold 
a lot of records with me finding out.

    Inlinks have other biases and they probably aren't so representative of 
what "people think" because a tiny fraction of people with otaku p.o.v.'s do 
most of the writing for Wikipedia.

    The page counts are here

http://dumps.wikimedia.org/other/pagecounts-raw/

    and they are huge because there is very fine granularity here.  A new 
file gets published hourly and these are typically 75MB these days.  These 
include all the Wikimedia projects so you have the different language 
encyclopedias,  wiktionaries,  wikimedia commons,  etc.  The files weren't 
so big five years ago,  but there's a lot of them so the bulk of data is 
considerable.

     If you had all the data (or a larger sample than I'm working with now) 
you could plot the curve of interest in subject A over time,  see what 
people think about most around 04:00 GMT,  what things trend up for 
Halloween,  etc. In some ways you can use the time-based data to substitute 
for a more complex "reasoning" procedure.

     If you pick out random files over the last five years,  you get results 
that are not so "volatile" in the sense that they're not completely 
dominated by the superhero movie that's in theatres right now.  They're 
still biased in the sense that somebody forgettable named "Kim Kardashian" 
is on the top 20 people list (I've never seen her on TV but at least I've 
met people who've met her at parties in Hollywood.)

     Anyhow they can be compressed down to one time-averaged number,  and 
I'll be publishing that data shortly.



------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] querying via sparql & ranking

Reply via email to