Hi all,

I am currently trying to calculate all-pairs similarity between a large number 
of text documents. I am using a TfidfVectorizer for feature generation and then 
want to calculate cosine similarity between the pairs. Hence, I am calculating 
X * X.T between the L2 normalized matrices.

As my data is very large (X.shape = (350363, 2526183)), I thought about 
reducing the dimensionality first. I am using the SparseRandomProjection for 
this task with the default parameters. I do not normalize the tfidf features 
first, then perform the random projection and then L2 normalize the resulting 
data before I multiply the matrix with its transpose. Unfortunately, the 
resulting similarity scores are outside the expected 10% error. Mostly 
somewhere around 20%.

Does anyone know what I am doing wrong?

Apart from that, does anyone know a solution of how I can efficiently calculate 
the resulting matrix Y = X * X.T? I am currently thinking about using PyTables 
with some sort of chunked calculation algorithm. Unfortunately, this is not the 
most efficient way of doing it in terms of speed but solves the memory 
bottleneck. I need the raw similarity scores between all documents in the end.

Thanks!
Best,
Philipp
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to