Hi all, I am currently trying to calculate all-pairs similarity between a large number of text documents. I am using a TfidfVectorizer for feature generation and then want to calculate cosine similarity between the pairs. Hence, I am calculating X * X.T between the L2 normalized matrices.
As my data is very large (X.shape = (350363, 2526183)), I thought about reducing the dimensionality first. I am using the SparseRandomProjection for this task with the default parameters. I do not normalize the tfidf features first, then perform the random projection and then L2 normalize the resulting data before I multiply the matrix with its transpose. Unfortunately, the resulting similarity scores are outside the expected 10% error. Mostly somewhere around 20%. Does anyone know what I am doing wrong? Apart from that, does anyone know a solution of how I can efficiently calculate the resulting matrix Y = X * X.T? I am currently thinking about using PyTables with some sort of chunked calculation algorithm. Unfortunately, this is not the most efficient way of doing it in terms of speed but solves the memory bottleneck. I need the raw similarity scores between all documents in the end. Thanks! Best, Philipp ------------------------------------------------------------------------------ Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
