Hi all,
I'm going to be giving a talk at the Bay Area ACM data mining SIG in
December, and I need to finalize my topic today :)
I was going to expand on my "Web mining for SEO keywords" talk from
the ACM unconference a few weeks back.
But the fact that I'll have 10s to 100s of millions of web page data
to work with, from my public terabyte dataset crawl, makes me want to
apply Mahout to the data.
So I'm going to toss out one idea...
- I'd like to automatically generate a timeline of events.
- I can extract dates and 2-to-4 word "terms" from web pages.
- Could I use LDA to create clusters of common terms for dates?
I don't want to get into named entity extraction, or anything that
involves more than simple data extraction and then the application of
a scalable algorithm currently supported by Mahout.
Looking for feedback on the above, in terms of feasibility, level of
effort, interest by others to help, etc.
Or if somebody else has a suggestion for something simple that could
provide interesting/obvious results, I'm all ears.
Thanks!
-- Ken
PS - Some additional thoughts on the problem...
* I would skip any of my multi-word terms that start/end with
stopwords.
* I would skip any pages that don't have associated dates.
* I wouldn't worry about trying to figure out which date "the
previous Tuesday" referred to.
* I could skip terms that have too many different associated dates
- though handling dates for a term that cluster (are close to one
another) would be important.
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g