Hey all. I'm working on distance measurement in a reasonably
high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
my 16GB laptop runs out of space, I've walked back through the code and
noticed something I don't understand.

sklearn.metrics.pairwise_distances('cosine') calls
pairwise.cosine_similarity which takes sparse inputs and preserves their
sparsity until the final call:
def cosine_similarity(X, Y)  # both inputs are csr sparse from a
DictVectorizer(...,sparse=True)
 X_normalized = normalize(...)  # sparse result
 Y_normalized = X_normalized  # as both inputs are the same, still sparse
 K = linear_kernel(X_normalized, Y_normalized)
->linear_kernel(X_normalized, Y_normalized)
 calls safe_sparse_dot(X, Y.T, dense_output=True)
and then the result is forced to be dense.

If safe_sparse_dot is called with dense_output=False then I get a sparse
result and everything looks sensible with low RAM usage.

I'm using 0.15, the current github shows the line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692

Was there a design decision to force dense matrices at this point? Maybe
some call paths assume a dense result?

Ian.

-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to