Let me share a bit of what I know about the page counts because I’ve been 
evaluating these as a subjective importance score too.

    It’s about 2 TB of data and I’ve been working with a slow connection so I 
need to work with samples of this data,  not the whole thing.

    I tried sampling a week worth of data and the results were a disaster.  In 
the first week of August,  Michael Phelps was the most important person in the 
world.  Maybe that was true.  But it’s not a good answer for a score that’s 
valid for all time.  It’s clear that the “prior distribution of concepts” that 
people look up in Wikipedia is highly time dependent and that’s probably also 
true for other prior distributions.

    So I grabbed about 50GB of data randomly sampled and got results that are 
better but still have a strong recency bias;  if you look at some movie,  like 
“The Avengers”,  you see that interest in the movie picks up when people read 
the hype,  is high when the movie is in theaters,  and then falls off to some 
long term trend.  With 5 years of data we can see the peak of the Avengers,  
but we don’t see the peak of the 1977 Star Wars movie,  so there’s an unfair 
bias towards the Avengers and against Star Wars.

    When you look at a bunch of things associated with the same time (“writers 
who were active in 1920”) then the recency bias will be less obnoxious.  It 
probably still hurts results for in the same way  variable document lengths 
caused so much trouble in TREC until Okapi got invented.

    Anyhow I am processing this data on a Hadoop cluster in my house,  as I get 
more data in my sample I can ask more specific questions and get more accurate 
answers.  The data could also be moved into the AWS cloud and processed with 
Elastic/Map Reduce...  In my case this is just a matter of pointing Pig at a 
different server.  I wrote to the people who run the public data sets in AWS

http://aws.amazon.com/publicdatasets/

about the possibility of getting the page counts hosted in AWS but I haven’t 
heard back from them yet.  If more people ask perhaps they’ll take some action 
on this.

I’ve done some other work that uses link-based importance scores and pretty 
obviously there are different things you could try there but it was hard to 
make a real project out of it because I didn’t have an evaluation set.  If you 
try to predict the popularity-based scores based on links based score that 
would at least give some basis for saying one kind of link-based score is 
better than another.  I’d do it but I have a lot of other projects that are 
more urgent if less interesting.

Anyway if anybody writes back I could make a turtle file with time-averaged 
popularity-based importance scores 

:Larry_King :importance 0.00003287 .

this would be about 4 million triples.  It would be great if I could get these 
hosted at the DBpedia site or otherwise I could host it on my site.




------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to