Phillip, how are your features stored -- what language / library ? Are the values binary / small ints / float ? To speed up the inner loop A . B, sort all the indices once, then A . B takes time ~ min( Nnonzero A, Nnonzero B ) . I don't know of a sparsevector lib that does this though. If the values are all positive you can even beat that, running sum >= nearest so far -> quit early, return Toofar. (Algorithms are more fun, inner loops important.)
For clustering (don't know your application) you might look at Markov clustering, http://www.micans.org/mcl . cheers -- denis On 28/10/2012 00:07, Philipp Singer wrote: > The problem with the lucene solution is that I do not need tfidf. I > really have to do simple cosine similarity on my available vectors. > > So e.g., my matrix (vectors) look the following way: > > [[1 2 5] > [3 1 0]] ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general