If your data is really, really sparse in the original space, you might also
look at taking a random projection (I think projecting on sparse SVD basis
would work too?) as preprocessing to "densify" the data before calling the
cosine similarity. You might get a win on feature size with this, depending
on the input size at the cost being an approximation rather than the real
deal.
It kind of looks like the code assumes sparse inputs will still result in a
dense Gram matrix, which means the matrices are assumed not to be
orthogonal (is *almost* orthogonal good enough to get mostly sparse in this
case? or must it be exact) *if* I am thinking correctly.
Kyle
On Thu, Nov 27, 2014 at 11:42 AM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:
> Dear Ian,
>
> I guess this comes from the assumption that these types of pairwise
> similarity matrices have been dense in many usecases. E.g. a gaussian
> kernel matrix is compeletely dense. Many kernel methods also expect dense
> input. But it is true that this latter fact shouldn't necessarily be
> imposed all the similarity measures if there is a possibility of sparse
> output...
>
> Michael
>
> On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <i...@ianozsvald.com> wrote:
>
>> Hey all. I'm working on distance measurement in a reasonably
>> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
>> my 16GB laptop runs out of space, I've walked back through the code and
>> noticed something I don't understand.
>>
>> sklearn.metrics.pairwise_distances('cosine') calls
>> pairwise.cosine_similarity which takes sparse inputs and preserves their
>> sparsity until the final call:
>> def cosine_similarity(X, Y) # both inputs are csr sparse from a
>> DictVectorizer(...,sparse=True)
>> X_normalized = normalize(...) # sparse result
>> Y_normalized = X_normalized # as both inputs are the same, still sparse
>> K = linear_kernel(X_normalized, Y_normalized)
>> ->linear_kernel(X_normalized, Y_normalized)
>> calls safe_sparse_dot(X, Y.T, dense_output=True)
>> and then the result is forced to be dense.
>>
>> If safe_sparse_dot is called with dense_output=False then I get a sparse
>> result and everything looks sensible with low RAM usage.
>>
>> I'm using 0.15, the current github shows the line:
>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>>
>> Was there a design decision to force dense matrices at this point? Maybe
>> some call paths assume a dense result?
>>
>> Ian.
>>
>> --
>> Ian Ozsvald (A.I. researcher)
>> i...@ianozsvald.com
>>
>> http://IanOzsvald.com
>> http://ModelInsight.io
>> http://MorConsulting.com
>> http://Annotate.IO
>> http://SocialTiesApp.com
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com
>> http://twitter.com/IanOzsvald
>> http://ShowMeDo.com
>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general