On 10/10/12 5:31 PM, [email protected] wrote:

Let me share a bit of what I know about the page counts because I've been evaluating these as a subjective importance score too.

It's about 2 TB of data and I've been working with a slow connection so I need to work with samples of this data, not the whole thing.

I tried sampling a week worth of data and the results were a disaster. In the first week of August, Michael Phelps was the most important person in the world. Maybe that was true. But it's not a good answer for a score that's valid for all time. It's clear that the "prior distribution of concepts" that people look up in Wikipedia is highly time dependent and that's probably also true for other prior distributions.

So I grabbed about 50GB of data randomly sampled and got results that are better but still have a strong recency bias; if you look at some movie, like "The Avengers", you see that interest in the movie picks up when people read the hype, is high when the movie is in theaters, and then falls off to some long term trend. With 5 years of data we can see the peak of the Avengers, but we don't see the peak of the 1977 Star Wars movie, so there's an unfair bias towards the Avengers and against Star Wars.

When you look at a bunch of things associated with the same time ("writers who were active in 1920") then the recency bias will be less obnoxious. It probably still hurts results for in the same way variable document lengths caused so much trouble in TREC until Okapi got invented.

Anyhow I am processing this data on a Hadoop cluster in my house, as I get more data in my sample I can ask more specific questions and get more accurate answers. The data could also be moved into the AWS cloud and processed with Elastic/Map Reduce... In my case this is just a matter of pointing Pig at a different server. I wrote to the people who run the public data sets in AWS

http://aws.amazon.com/publicdatasets/

about the possibility of getting the page counts hosted in AWS but I haven't heard back from them yet. If more people ask perhaps they'll take some action on this.

I've done some other work that uses link-based importance scores and pretty obviously there are different things you could try there but it was hard to make a real project out of it because I didn't have an evaluation set. If you try to predict the popularity-based scores based on links based score that would at least give some basis for saying one kind of link-based score is better than another. I'd do it but I have a lot of other projects that are more urgent if less interesting.

Anyway if anybody writes back I could make a turtle file with time-averaged popularity-based importance scores

:Larry_King :importance 0.00003287 .

this would be about 4 million triples. It would be great if I could get these hosted at the DBpedia site or otherwise I could host it on my site.


Publish your dataset and it will be added to the LOD Cloud cache we maintain. Ditto the DBpedia instance if its all about cross-references etc.

If you need an S3 bucket, give me a sense of size and I'll see what we can offer.


Kingsley



------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev


_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen




Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to