Hi Ian.
I'm not entirely certain, but I think the reason is that the output is assumed to be dense in most cases, and using sparse matrices would waste space.
So you have 1.3mil data points? How sparse is the resulting distance matrix?
I guess if your input is sparse enough, most vectors are orthogonal. Is that what is happening?
Maybe the dense output makes more sense for distances than similarities.

Cheers,
Andy


On 11/27/2014 11:26 AM, Ian Ozsvald wrote:
Hey all. I'm working on distance measurement in a reasonably high-dimensional but very sparse space (1.3mil * 35k matrix). At this size my 16GB laptop runs out of space, I've walked back through the code and noticed something I don't understand.

sklearn.metrics.pairwise_distances('cosine') calls
pairwise.cosine_similarity which takes sparse inputs and preserves their sparsity until the final call: def cosine_similarity(X, Y) # both inputs are csr sparse from a DictVectorizer(...,sparse=True)
 X_normalized = normalize(...)  # sparse result
 Y_normalized = X_normalized  # as both inputs are the same, still sparse
 K = linear_kernel(X_normalized, Y_normalized)
->linear_kernel(X_normalized, Y_normalized)
 calls safe_sparse_dot(X, Y.T, dense_output=True)
and then the result is forced to be dense.

If safe_sparse_dot is called with dense_output=False then I get a sparse result and everything looks sensible with low RAM usage.

I'm using 0.15, the current github shows the line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692

Was there a design decision to force dense matrices at this point? Maybe some call paths assume a dense result?

Ian.

--
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to