For a few years I've been using counts of inlinks from the "Wikipedia
Pagelinks" as a subjective importance measure. I'm now transitioning to
something based on pageview statistics (page counts).
I say "subjective" because this is one of those things where there isn't
necessarily a right answer, something can be better or worse, but you
can't specify everything -- sentiment analysis and full-text retrieval are
similar domains. For instance, unless we were building a KB about
Binghamton NY, it's certainly true that the New York Yankees are more
important than the Binghamton Mets (a minor league baseball team). A system
that ranks that wrong is failing and ought to go back to the developers.
Although sports fans in the Northeast have strong opinions, you can't say
the system failed a test if it didn't rank the Red Sox and the Yankees the
way that you rank them. If you go to a sports bar in upstate NY wearing a
Red Sox cap you'll find there's no objective answer to this one.
If we count links or count page views we could say that more people
looked at team A or that team B has more links, but when we use these
scores we're often using them as a proxy for something else. For instance,
a named entity disambiguator could do a better job if it knows that "Ithaca,
NY" gets talked about much more than "Ithaca, Nebraska". The function you
really want is to know the rate at which an entity gets talked about in a
stream of text -- you could get human judgements about that at great cost,
or you could say that this correlates well to inlink count or pageview
count.
When you construct a KB using this sort of method you'll always find that
you disagree with it somewhat. For instance, some guy named "Lil Wayne" is
one of the top 20 people in page views. I can say that underground Hip Hop
from NYC is about 1/3 of the soundtrack of my life but I had to look this
guy up in Wikipedia to see who he his. Wikipedia says he's had more singles
on the Billboard Hot 100 than anybody else in history, but somehow he sold
a lot of records with me finding out.
Inlinks have other biases and they probably aren't so representative of
what "people think" because a tiny fraction of people with otaku p.o.v.'s do
most of the writing for Wikipedia.
The page counts are here
http://dumps.wikimedia.org/other/pagecounts-raw/
and they are huge because there is very fine granularity here. A new
file gets published hourly and these are typically 75MB these days. These
include all the Wikimedia projects so you have the different language
encyclopedias, wiktionaries, wikimedia commons, etc. The files weren't
so big five years ago, but there's a lot of them so the bulk of data is
considerable.
If you had all the data (or a larger sample than I'm working with now)
you could plot the curve of interest in subject A over time, see what
people think about most around 04:00 GMT, what things trend up for
Halloween, etc. In some ways you can use the time-based data to substitute
for a more complex "reasoning" procedure.
If you pick out random files over the last five years, you get results
that are not so "volatile" in the sense that they're not completely
dominated by the superhero movie that's in theatres right now. They're
still biased in the sense that somebody forgettable named "Kim Kardashian"
is on the top 20 people list (I've never seen her on TV but at least I've
met people who've met her at parties in Hollywood.)
Anyhow they can be compressed down to one time-averaged number, and
I'll be publishing that data shortly.
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion