Nice assignment! Just curious if re-tweets in the data were removed. Not sure how much that will affect the results, but it does seem to me that tweets from popular tweeters are usually re-tweeted by users. If that is a significant portion of tweets in the unpopular tweeters then the distributions of those two top-terms lists will end up appearing more similar than they should, if the re-tweets are not removed.
I am guessing there are much less number of popular tweeters than unpopular ones. That will (I think) most likely translate to the top-terms list for popular tweeters being topically less diverse than that of the unpopular users. How to evaluate this hypothesis doesn't seem very obvious. Perhaps word clustering using WordNet senses is one way to start -- but Twitter vocabulary will certainly challenge WordNet coverage. Thanks, -Mahesh ________________________________ From: Ted Pedersen <duluth...@gmail.com> To: nlpatumd@yahoogroups.com Sent: Sunday, April 10, 2011 11:12 AM Subject: Re: [nlpatumd] a twitter puzzle In my computer architecture class this semester (Spring 2011) we've been focusing on using Hadoop on a big cluster down at the Minnesota SuperComputing Institute. We recently did an assignment based on Twitter data, where I collected about 100 million tweets sent between March 8 - 31. I posed the assignment more or less in these terms... There are popular people and unpopular people on Twitter. The popular ones have more followers than they follow. Unpopular people follow more people than follow them. Find out which are the most popular terms as use by popular and unpopular people, and we'll compare those lists to see what makes a tweeter popular or not...Terms were selected based on frequency, and had to be 6 characters or more long. My question to you is simple. Would you expect any differences in the most frequent terms used by popular and unpopular people? If so, what might they be? I'll report back in a few days. Enjoy, Ted PS Here's the formal assignment statement... https://sites.google.com/site/duluthted/cs-5621-computer-architecture---spring-2011/programming-assignment-4---due-friday-april-8-by-5pm-to-the-webdrop -- Ted Pedersen http://www.d.umn.edu/~tpederse