2010/1/20 Ian Holsman <[email protected]>:
> On 1/20/10 2:35 AM, Jason Rutherglen wrote:
>>
>> We've got Newsgroup classification. I'm kinda of interested in
>> creating a Twitter classification system, or at least playing
>> around with it. Also I think as a relevant growing large data
>> set, it seems Twitter fit well with Hadoop based machine
>> learning algorithms... Just throwing out into the wild!
>>
>>
>
> Hi Jason.
> I think the biggest issues here are twofold.
>
> 1. access to the data, although I'm sure the ASF could work something out
> here

Firehose (the live complete twitter stream) is going to be open to the
public this year. In the mean time the mean time it is possible to
gain access to a sample stream and to perform adhoc search queries on
specific terms or user profiles.

> 2. training data. wouldn't you need a set of 'tweets' classified in some
> manner? or were you thinking of using a different data source to base it on?

I see two obvious sources for labels in the twitter data:

 - #hastags placed by the users themselves (the 1000 most popular
hashtags or so must be consensual enough to extract signal from noise)
 - the twitter lists flagging users and their average tweet content by
transitivity. Again the top recurring listnames must mean something
somewhat universal enough.

There is also the location data of the authors in case you want the to
learn a model the sentiment of discussions by world countries for
instance.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Reply via email to