Hi Ian.
I'm not entirely certain, but I think the reason is that the output is
assumed to be dense in most cases, and using sparse matrices would waste
space.
So you have 1.3mil data points? How sparse is the resulting distance matrix?
I guess if your input is sparse enough, most vectors are orthogonal. Is
that what is happening?
Maybe the dense output makes more sense for distances than similarities.
Cheers,
Andy
On 11/27/2014 11:26 AM, Ian Ozsvald wrote:
Hey all. I'm working on distance measurement in a reasonably
high-dimensional but very sparse space (1.3mil * 35k matrix). At this
size my 16GB laptop runs out of space, I've walked back through the
code and noticed something I don't understand.
sklearn.metrics.pairwise_distances('cosine') calls
pairwise.cosine_similarity which takes sparse inputs and preserves
their sparsity until the final call:
def cosine_similarity(X, Y) # both inputs are csr sparse from a
DictVectorizer(...,sparse=True)
X_normalized = normalize(...) # sparse result
Y_normalized = X_normalized # as both inputs are the same, still sparse
K = linear_kernel(X_normalized, Y_normalized)
->linear_kernel(X_normalized, Y_normalized)
calls safe_sparse_dot(X, Y.T, dense_output=True)
and then the result is forced to be dense.
If safe_sparse_dot is called with dense_output=False then I get a
sparse result and everything looks sensible with low RAM usage.
I'm using 0.15, the current github shows the line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
Was there a design decision to force dense matrices at this point?
Maybe some call paths assume a dense result?
Ian.
--
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com
http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general