Phillip,
   how are your features stored -- what language / library ?
Are the values binary / small ints / float ?
To speed up the inner loop A . B, sort all the indices once,
then A . B takes time ~ min( Nnonzero A, Nnonzero B ) .
I don't know of a sparsevector lib that does this though.
If the values are all positive you can even beat that,
running sum >= nearest so far -> quit early, return Toofar.
(Algorithms are more fun, inner loops important.)

For clustering (don't know your application)
you might look at Markov clustering, http://www.micans.org/mcl .

cheers
   -- denis


On 28/10/2012 00:07, Philipp Singer wrote:

> The problem with the lucene solution is that I do not need tfidf. I
> really have to do simple cosine similarity on my available vectors.
>
> So e.g., my matrix (vectors) look the following way:
>
> [[1 2 5]
>     [3 1 0]]


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to