see this paper: http://www.cs.umass.edu/~mimno/papers/fast-topic-model.pdf
--efficient LDA over streams. looks like what you need here Miles 2009/11/16 Grant Ingersoll <[email protected]>: > > On Nov 16, 2009, at 2:14 PM, Ken Krugler wrote: > >> Hi all, >> >> I'm going to be giving a talk at the Bay Area ACM data mining SIG in >> December, and I need to finalize my topic today :) >> >> I was going to expand on my "Web mining for SEO keywords" talk from the ACM >> unconference a few weeks back. >> >> But the fact that I'll have 10s to 100s of millions of web page data to work >> with, from my public terabyte dataset crawl, makes me want to apply Mahout >> to the data. >> >> So I'm going to toss out one idea... >> >> - I'd like to automatically generate a timeline of events. >> >> - I can extract dates and 2-to-4 word "terms" from web pages. >> >> - Could I use LDA to create clusters of common terms for dates? > > I recall David Hall saying something about using LDA for Topics over time > which would be interesting too. Last we talked, he said he had an > implementation as well, but that was this summer and I haven't seen a patch. > >> >> I don't want to get into named entity extraction, or anything that involves >> more than simple data extraction and then the application of a scalable >> algorithm currently supported by Mahout. >> >> Looking for feedback on the above, in terms of feasibility, level of effort, >> interest by others to help, etc. >> >> Or if somebody else has a suggestion for something simple that could provide >> interesting/obvious results, I'm all ears. > > > > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
