I did, some 6+ months ago (pre all-IDs-are-longs changes). I remember seeing the most time spent in TanimotoCoefficientSimilarity and thinking "damn, this is all just set intersection and basic math operations - how do I speed that up?".
Otis ----- Original Message ---- > From: Grant Ingersoll <[email protected]> > To: [email protected] > Sent: Tue, November 24, 2009 3:25:53 PM > Subject: Re: Taste speed > > Have you done any profiling? It would be interesting to know where the > bottlenecks are on your dataset. > > -Grant > > On Nov 24, 2009, at 2:37 PM, Otis Gospodnetic wrote: > > > Correction for the number of user and item data: > > Users: 25K > > Items: 2K > > > > I am less worried about increasing the number of potential items to > > recommend. > > I am more interested in getting more users into Taste, so the larger > percentage of my users can get recommendations. > > For example, to filter out users I require certain level of activity in > > terms > of the number of items previously consumed. > > With that threshold at 15, I get about 25K users (the above) -- so 25K > > users > consumed 15 or more items > > With 10, I get about 50K users who consumed 10 or more items. > > With 5, I get about 200K users who consumed 5 or more items (presumably > > just 5 > items would produce good-enough recommendations) > > > > I know I could lower the sampling rate and get more users in, but that > > feels > like cheating and will lower the quality of recommendations. I have a > feeling > even with the sampling rate of 1.0 I should be able to get more users into > Taste > and still have Taste give me recommendations in 100-200ms with only 150-300 > reqs/minute. > > > > > > Otis > > > > > > > > ----- Original Message ---- > >> From: Otis Gospodnetic > >> To: [email protected] > >> Sent: Tue, November 24, 2009 2:10:07 PM > >> Subject: Taste speed > >> > >> Hello, > >> > >> I've been using Taste for a while, but it's not scaling well, and I > >> suspect > I'm > >> doing something wrong. > >> When I say "not scaling well", this is what I mean: > >> * I have 1 week's worth of data (user,item datapoints) > >> * I don't have item preferences, so I'm using the boolean model > >> * I have caching in front of Taste, so the rate of requests that Taste > >> needs > to > >> handle is only 150-300 reqs/minute/server > >> * The server is an 8-core 2.5GHz 32-bit machine with 32 GB of RAM > >> * I use 2GB heap (-server -Xms2000M -Xmx2000M -XX:+AggressiveHeap > >> -XX:MaxPermSize=128M -XX:+CMSClassUnloadingEnabled > >> -XX:+CMSPermGenSweepingEnabled) and Java 1.5 (upgrade scheduled for Spring) > >> > >> ** The bottom line is that with all of the above, I have to filter out > >> less > >> popular items and less active users in order to be able to return > >> recommendations in a reasonable amount of time (e.g. 100-200 ms at the > 150-300 > >> reqs/min rate). In the end, after this filtering, I end up with, say, 30K > users > >> and 50K items, and that's what I use to build the DataModel. If I remove > >> filtering and let more data in, the performance goes down the drain. > >> > >> My feeling is 30K users and 50K items makes for an awfully small data set > >> and > > >> that Taste, esp. at only > >> 150-300 reqs/min on an 8-core server should be much faster. I have a > >> feeling > > >> I'm doing something wrong and that Taste is really capable of handling > >> more > >> data, faster. Here is the code I use to construct the recommender: > >> > >> idMigrator = LocalMemoryIDMigrator.getInstance(); > >> model = MyDataModel.getInstance("itemType"); > >> > >> // ItemSimilarity similarity = new LogLikelihoodSimilarity(model); > >> similarity = new TanimotoCoefficientSimilarity(model); > >> similarity = new CachingUserSimilarity(similarity, model); > >> > >> // hood size is 50, minSimilarity is 0.1, samplingRate is 1.0 > >> hood = new NearestNUserNeighborhood(hoodSize, minSimilarity,similarity, > >> model, samplingRate); > >> > >> recommender = new GenericUserBasedRecommender(model, hood, similarity); > >> recommender = new CachingRecommender(recommender); > >> > >> What do you think of the above numbers? > >> > >> Thanks, > >> Otis > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search
