Feedback on "big data" analysis talk

Ken Krugler Mon, 16 Nov 2009 11:15:43 -0800

Hi all,

I'm going to be giving a talk at the Bay Area ACM data mining SIG inDecember, and I need to finalize my topic today :)

I was going to expand on my "Web mining for SEO keywords" talk fromthe ACM unconference a few weeks back.

But the fact that I'll have 10s to 100s of millions of web page datato work with, from my public terabyte dataset crawl, makes me want toapply Mahout to the data.


So I'm going to toss out one idea...

 - I'd like to automatically generate a timeline of events.

 - I can extract dates and 2-to-4 word "terms" from web pages.

- Could I use LDA to create clusters of common terms for dates?

I don't want to get into named entity extraction, or anything thatinvolves more than simple data extraction and then the application ofa scalable algorithm currently supported by Mahout.

Looking for feedback on the above, in terms of feasibility, level ofeffort, interest by others to help, etc.

Or if somebody else has a suggestion for something simple that couldprovide interesting/obvious results, I'm all ears.


Thanks!

-- Ken

PS - Some additional thoughts on the problem...

* I would skip any of my multi-word terms that start/end withstopwords.

 * I would skip any pages that don't have associated dates.

* I wouldn't worry about trying to figure out which date "theprevious Tuesday" referred to.

 * I could skip terms that have too many different associated dates

- though handling dates for a term that cluster (are close to oneanother) would be important.


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Feedback on "big data" analysis talk

Reply via email to