On Nov 16, 2009, at 2:14 PM, Ken Krugler wrote:

> Hi all,
> 
> I'm going to be giving a talk at the Bay Area ACM data mining SIG in 
> December, and I need to finalize my topic today :)
> 
> I was going to expand on my "Web mining for SEO keywords" talk from the ACM 
> unconference a few weeks back.
> 
> But the fact that I'll have 10s to 100s of millions of web page data to work 
> with, from my public terabyte dataset crawl, makes me want to apply Mahout to 
> the data.
> 
> So I'm going to toss out one idea...
> 
> - I'd like to automatically generate a timeline of events.
> 
> - I can extract dates and 2-to-4 word "terms" from web pages.
> 
> - Could I use LDA to create clusters of common terms for dates?

I recall David Hall saying something about using LDA for Topics over time which 
would be interesting too.  Last we talked, he said he had an implementation as 
well, but that was this summer and I haven't seen a patch.

> 
> I don't want to get into named entity extraction, or anything that involves 
> more than simple data extraction and then the application of a scalable 
> algorithm currently supported by Mahout.
> 
> Looking for feedback on the above, in terms of feasibility, level of effort, 
> interest by others to help, etc.
> 
> Or if somebody else has a suggestion for something simple that could provide 
> interesting/obvious results, I'm all ears.





Reply via email to