Agree, and I am almost sure it is in the string-long conversion.
Particularly, this could be nasty in your DataModel.

I did an implementation this summer where conversion was needed, and
the data was in a database. For this to work, you really have to
translate longs back to strings before hitting the database, for
example, so that queries can use the natively-indexed string IDs.
There are many ways to work it, and only a few that perform.

Reply privately if the details are a little confidential, I think I
can provide more insights. I don't think this is the best that can be
done even with translation.

On Tue, Nov 24, 2009 at 8:25 PM, Grant Ingersoll <[email protected]> wrote:
> Have you done any profiling?  It would be interesting to know where the 
> bottlenecks are on your dataset.
>
> -Grant
>
> On Nov 24, 2009, at 2:37 PM, Otis Gospodnetic wrote:
>
>> Correction for the number of user and item data:
>> Users: 25K
>> Items: 2K
>>
>> I am less worried about increasing the number of potential items to 
>> recommend.
>> I am more interested in getting more users into Taste, so the larger 
>> percentage of my users can get recommendations.
>> For example, to filter out users I require certain level of activity in 
>> terms of the number of items previously consumed.
>> With that threshold at 15, I get about 25K users (the above) -- so 25K users 
>> consumed 15 or more items
>> With 10, I get about 50K users who consumed 10 or more items.
>> With 5, I get about 200K users who consumed 5 or more items (presumably just 
>> 5 items would produce good-enough recommendations)
>>
>> I know I could lower the sampling rate and get more users in, but that feels 
>> like cheating and will lower the quality of recommendations.  I have a 
>> feeling even with the sampling rate of 1.0 I should be able to get more 
>> users into Taste and still have Taste give me recommendations in 100-200ms 
>> with only 150-300 reqs/minute.
>>
>>
>> Otis
>>
>>
>>
>> ----- Original Message ----
>>> From: Otis Gospodnetic <[email protected]>
>>> To: [email protected]
>>> Sent: Tue, November 24, 2009 2:10:07 PM
>>> Subject: Taste speed
>>>
>>> Hello,
>>>
>>> I've been using Taste for a while, but it's not scaling well, and I suspect 
>>> I'm
>>> doing something wrong.
>>> When I say "not scaling well", this is what I mean:
>>> * I have 1 week's worth of data (user,item datapoints)
>>> * I don't have item preferences, so I'm using the boolean model
>>> * I have caching in front of Taste, so the rate of requests that Taste 
>>> needs to
>>> handle is only 150-300 reqs/minute/server
>>> * The server is an 8-core 2.5GHz 32-bit machine with 32 GB of RAM
>>> * I use 2GB heap (-server -Xms2000M -Xmx2000M -XX:+AggressiveHeap
>>> -XX:MaxPermSize=128M -XX:+CMSClassUnloadingEnabled
>>> -XX:+CMSPermGenSweepingEnabled) and Java 1.5 (upgrade scheduled for Spring)
>>>
>>> ** The bottom line is that with all of the above, I have to filter out less
>>> popular items and less active users in order to be able to return
>>> recommendations in a reasonable amount of time (e.g. 100-200 ms at the 
>>> 150-300
>>> reqs/min rate).  In the end, after this filtering, I end up with, say, 30K 
>>> users
>>> and 50K items, and that's what I use to build the DataModel.  If I remove
>>> filtering and let more data in, the performance goes down the drain.
>>>
>>> My feeling is 30K users and 50K items makes for an awfully small data set 
>>> and
>>> that Taste, esp. at only
>>> 150-300 reqs/min on an 8-core server should be much faster.  I have a 
>>> feeling
>>> I'm doing something wrong and that Taste is really capable of handling more
>>> data, faster.  Here is the code I use to construct the recommender:
>>>
>>>    idMigrator = LocalMemoryIDMigrator.getInstance();
>>>    model = MyDataModel.getInstance("itemType");
>>>
>>>    // ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
>>>    similarity = new TanimotoCoefficientSimilarity(model);
>>>    similarity = new CachingUserSimilarity(similarity, model);
>>>
>>>    // hood size is 50, minSimilarity is 0.1, samplingRate is 1.0
>>>    hood = new NearestNUserNeighborhood(hoodSize, minSimilarity,similarity,
>>> model, samplingRate);
>>>
>>>    recommender = new GenericUserBasedRecommender(model, hood, similarity);
>>>    recommender = new CachingRecommender(recommender);
>>>
>>> What do you think of the above numbers?
>>>
>>> Thanks,
>>> Otis
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to