Let me share a bit of what I know about the page counts because I’ve been
evaluating these as a subjective importance score too.
It’s about 2 TB of data and I’ve been working with a slow connection so I
need to work with samples of this data, not the whole thing.
I tried sampling a week worth of data and the results were a disaster. In
the first week of August, Michael Phelps was the most important person in the
world. Maybe that was true. But it’s not a good answer for a score that’s
valid for all time. It’s clear that the “prior distribution of concepts” that
people look up in Wikipedia is highly time dependent and that’s probably also
true for other prior distributions.
So I grabbed about 50GB of data randomly sampled and got results that are
better but still have a strong recency bias; if you look at some movie, like
“The Avengers”, you see that interest in the movie picks up when people read
the hype, is high when the movie is in theaters, and then falls off to some
long term trend. With 5 years of data we can see the peak of the Avengers,
but we don’t see the peak of the 1977 Star Wars movie, so there’s an unfair
bias towards the Avengers and against Star Wars.
When you look at a bunch of things associated with the same time (“writers
who were active in 1920”) then the recency bias will be less obnoxious. It
probably still hurts results for in the same way variable document lengths
caused so much trouble in TREC until Okapi got invented.
Anyhow I am processing this data on a Hadoop cluster in my house, as I get
more data in my sample I can ask more specific questions and get more accurate
answers. The data could also be moved into the AWS cloud and processed with
Elastic/Map Reduce... In my case this is just a matter of pointing Pig at a
different server. I wrote to the people who run the public data sets in AWS
http://aws.amazon.com/publicdatasets/
about the possibility of getting the page counts hosted in AWS but I haven’t
heard back from them yet. If more people ask perhaps they’ll take some action
on this.
I’ve done some other work that uses link-based importance scores and pretty
obviously there are different things you could try there but it was hard to
make a real project out of it because I didn’t have an evaluation set. If you
try to predict the popularity-based scores based on links based score that
would at least give some basis for saying one kind of link-based score is
better than another. I’d do it but I have a lot of other projects that are
more urgent if less interesting.
Anyway if anybody writes back I could make a turtle file with time-averaged
popularity-based importance scores
:Larry_King :importance 0.00003287 .
this would be about 4 million triples. It would be great if I could get these
hosted at the DBpedia site or otherwise I could host it on my site.
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion