Re: Feedback on "big data" analysis talk

Miles Osborne Mon, 16 Nov 2009 14:03:17 -0800

see this paper:

http://www.cs.umass.edu/~mimno/papers/fast-topic-model.pdf


--efficient LDA over streams.  looks like what you need here

Miles

2009/11/16 Grant Ingersoll <[email protected]>:
>
> On Nov 16, 2009, at 2:14 PM, Ken Krugler wrote:
>
>> Hi all,
>>
>> I'm going to be giving a talk at the Bay Area ACM data mining SIG in 
>> December, and I need to finalize my topic today :)
>>
>> I was going to expand on my "Web mining for SEO keywords" talk from the ACM 
>> unconference a few weeks back.
>>
>> But the fact that I'll have 10s to 100s of millions of web page data to work 
>> with, from my public terabyte dataset crawl, makes me want to apply Mahout 
>> to the data.
>>
>> So I'm going to toss out one idea...
>>
>> - I'd like to automatically generate a timeline of events.
>>
>> - I can extract dates and 2-to-4 word "terms" from web pages.
>>
>> - Could I use LDA to create clusters of common terms for dates?
>
> I recall David Hall saying something about using LDA for Topics over time 
> which would be interesting too.  Last we talked, he said he had an 
> implementation as well, but that was this summer and I haven't seen a patch.
>
>>
>> I don't want to get into named entity extraction, or anything that involves 
>> more than simple data extraction and then the application of a scalable 
>> algorithm currently supported by Mahout.
>>
>> Looking for feedback on the above, in terms of feasibility, level of effort, 
>> interest by others to help, etc.
>>
>> Or if somebody else has a suggestion for something simple that could provide 
>> interesting/obvious results, I'm all ears.
>
>
>
>
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Feedback on "big data" analysis talk

Reply via email to