Agree, and I am almost sure it is in the string-long conversion. Particularly, this could be nasty in your DataModel.
I did an implementation this summer where conversion was needed, and the data was in a database. For this to work, you really have to translate longs back to strings before hitting the database, for example, so that queries can use the natively-indexed string IDs. There are many ways to work it, and only a few that perform. Reply privately if the details are a little confidential, I think I can provide more insights. I don't think this is the best that can be done even with translation. On Tue, Nov 24, 2009 at 8:25 PM, Grant Ingersoll <[email protected]> wrote: > Have you done any profiling? It would be interesting to know where the > bottlenecks are on your dataset. > > -Grant > > On Nov 24, 2009, at 2:37 PM, Otis Gospodnetic wrote: > >> Correction for the number of user and item data: >> Users: 25K >> Items: 2K >> >> I am less worried about increasing the number of potential items to >> recommend. >> I am more interested in getting more users into Taste, so the larger >> percentage of my users can get recommendations. >> For example, to filter out users I require certain level of activity in >> terms of the number of items previously consumed. >> With that threshold at 15, I get about 25K users (the above) -- so 25K users >> consumed 15 or more items >> With 10, I get about 50K users who consumed 10 or more items. >> With 5, I get about 200K users who consumed 5 or more items (presumably just >> 5 items would produce good-enough recommendations) >> >> I know I could lower the sampling rate and get more users in, but that feels >> like cheating and will lower the quality of recommendations. I have a >> feeling even with the sampling rate of 1.0 I should be able to get more >> users into Taste and still have Taste give me recommendations in 100-200ms >> with only 150-300 reqs/minute. >> >> >> Otis >> >> >> >> ----- Original Message ---- >>> From: Otis Gospodnetic <[email protected]> >>> To: [email protected] >>> Sent: Tue, November 24, 2009 2:10:07 PM >>> Subject: Taste speed >>> >>> Hello, >>> >>> I've been using Taste for a while, but it's not scaling well, and I suspect >>> I'm >>> doing something wrong. >>> When I say "not scaling well", this is what I mean: >>> * I have 1 week's worth of data (user,item datapoints) >>> * I don't have item preferences, so I'm using the boolean model >>> * I have caching in front of Taste, so the rate of requests that Taste >>> needs to >>> handle is only 150-300 reqs/minute/server >>> * The server is an 8-core 2.5GHz 32-bit machine with 32 GB of RAM >>> * I use 2GB heap (-server -Xms2000M -Xmx2000M -XX:+AggressiveHeap >>> -XX:MaxPermSize=128M -XX:+CMSClassUnloadingEnabled >>> -XX:+CMSPermGenSweepingEnabled) and Java 1.5 (upgrade scheduled for Spring) >>> >>> ** The bottom line is that with all of the above, I have to filter out less >>> popular items and less active users in order to be able to return >>> recommendations in a reasonable amount of time (e.g. 100-200 ms at the >>> 150-300 >>> reqs/min rate). In the end, after this filtering, I end up with, say, 30K >>> users >>> and 50K items, and that's what I use to build the DataModel. If I remove >>> filtering and let more data in, the performance goes down the drain. >>> >>> My feeling is 30K users and 50K items makes for an awfully small data set >>> and >>> that Taste, esp. at only >>> 150-300 reqs/min on an 8-core server should be much faster. I have a >>> feeling >>> I'm doing something wrong and that Taste is really capable of handling more >>> data, faster. Here is the code I use to construct the recommender: >>> >>> idMigrator = LocalMemoryIDMigrator.getInstance(); >>> model = MyDataModel.getInstance("itemType"); >>> >>> // ItemSimilarity similarity = new LogLikelihoodSimilarity(model); >>> similarity = new TanimotoCoefficientSimilarity(model); >>> similarity = new CachingUserSimilarity(similarity, model); >>> >>> // hood size is 50, minSimilarity is 0.1, samplingRate is 1.0 >>> hood = new NearestNUserNeighborhood(hoodSize, minSimilarity,similarity, >>> model, samplingRate); >>> >>> recommender = new GenericUserBasedRecommender(model, hood, similarity); >>> recommender = new CachingRecommender(recommender); >>> >>> What do you think of the above numbers? >>> >>> Thanks, >>> Otis >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >
