I am no Mahout expert, but do knw MR, CF and ML. I feel if you do a sparse matrix implementation of K Means things will be much simpler and manageable.
rough calculation: ~1M unique users ~1000 unique sites a user visits in a month ~1G entries ~ 4 bytes per entry ~ 4G Data ~ x10(overheads) 40G ~ Distribute data to multiple machines grouped by user id. - KMeans is a perfect algo to be implemented in MR paradigm. I dont think creation and clustering is gonna take any thing more than couple of hours. I have done similar experiments at much larger scale, and it works. Let me know if you have any issues. Abhishek On Wed, Feb 18, 2009 at 12:13 AM, Marcus Herou <[email protected]>wrote: > Hi. > > Been visiting some mailinglists trying to find directions for howto find > address a quite cool problem. The answer have so far been. "Look at K-Means > clustering". Since I'm quite familiar with Hadoop which is incorporated in > both our crawler and into our webstats engine it seemed that the right gang > to turn to was the Mahout users. > > Basically we have 40k sites in our network of which we track weblogstats > like unique browsers, page impressions and sessions etc. > We want to learn more about our network so I've started to develop a > solution which would create a similarity matrix so you could say: > > These 10 sites are most similar to Site X in terms of visiting patterns > i.e. > same kind of audience. > > We have one big problem though... It will take 5 years to compute this > matrix at the current implementation speed :) That's why I'm starting to > look elsewhere. > > The matrix is really simple and below is an example > site1 site2 site3.... > > uid1 X X > uid2 X X > uid3 X > .... > > or table wise could be something like > CREATE TABLE UniqueSiteVisitorSample( > s1 bit, > s2 bit, > s3 bit, > .... > uid bigint > ) > > Where the X (bit set) means that one visitor visited a specific site, so > two > sites with many common "X's" is similar... > As I said _very_ simple datastructure but large and I don't know of any > storage mechanism where you could store 40k+ columns... > > Would it be feasible to use Mahout to create some output which stated how > similar SiteX is with SiteA,SiteB etc ? > > If it takes some hours (or days) to compute then it's quite OK from my > standpoint since the matrix could be recreated every X weeks or so. > > Hope anyone here at the list thinks, "Man this guy is stupid, he should do > it like this!":) > > /Marcus > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [email protected] > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- cheers, Abhishek Agarwal www.it.iitb.ac.in/~abhisheka
