The last writer is suggesting using the triangle inequality to cut down the
search space.  If c is the centroid of cluster C, then the closest any
point in C is to x is ||x-c|| - r(C), where r(C) is the (precomputed)
radius of the cluster---the distance of the farthest point in C to c.
 Whether you want to use the bound for an exact algorithm or a search
heuristic is up to you.

Another approach can be very successful if the user-attributes matrix is
very sparse:  The idea is to exploit the additive nature of common
similarity measures.  I have used this for cosine and Bayesian posterior.
 Here is the basic idea: A "User" index contains all the attributes for a
certain user (sorted by strength), and an "Attribute" index contains all
the users with a certain attribute.  (Again, sorted by strength).  To find
the best k matches for user i, start with user i's strongest attribute and
search the attribute index to find some users with high scores for that
attribute.  Move to user i's next best attribute.  If you proceed with a
"diagonal frontier" (as you fetch candidates bases on i's weaker
attributes, also go back and fetch more weak candidates for i's strong
attributes), you can get a candidate pool and a bound on the highest score
of any user not in the candidate pool.  You just expand the candidate pool
until the bound drops below the scores of the k best candidates.
 Practically, you'd want to limit the candidate pool size, but it's almost
always exact.

For a million users, you should be able to distribute the things needed to
make a recommendation (either the centroids or the attributes matrix), and
just break up the work based on the users you want to generate
recommendations for.  I hope this helps.

Tom


On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li <lixiaolima...@gmail.com> wrote:

> Hi Guillaume,
>
> This sounds a good idea to me. I am a newbie here. Could you further
> explain how will you determine which clusters to keep? According to the
> distance between each element with each cluster center?
> Will you keep several clusters for each element for searching nearest
> neighbours? Thanks.
>
>   Xiaoli
>
>
>
>
> On Sat, Apr 12, 2014 at 9:46 AM, Guillaume Pitel <
> guillaume.pi...@exensa.com> wrote:
>
>>  Hi,
>>
>> I'm doing this here for multiple tens of millions of elements (and the
>> goal is to reach multiple billions), on a relatively small cluster (7 nodes
>> 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a
>> Kmeans on your data, then compute the distance between each element with
>> each cluster center, keep a few clusters and only look into these clusters
>> for nearest neighbours.
>>
>> This method is known to perform very well and vastly speedup your
>> computation
>>
>> The hardest part is to decide how many clusters to compute, and how many
>> to keep. As a rule of thumb, I generally want 300-10000 elements per
>> cluster, and use 5-20 clusters.
>>
>> Guillaume
>>
>>
>>  I am implementing an algorithm using Spark. I have one million users. I
>> need to compute the similarity between each pair of users using some user's
>> attributes.  For each user, I need to get top k most similar users. What is
>> the best way to implement this?
>>
>>
>>  Thanks.
>>
>>
>>
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>
>>  eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>
>
>

<<inline: exensa_logo_mail.png>>

Reply via email to