Re: [Scikit-learn-general] pairwise.cosine_similarity(...) takes sparse inputs but forces a dense output?

Kyle Kastner Thu, 27 Nov 2014 09:24:09 -0800

On a side note, I am semi-surprised that allowing the output of the dot to
be sparse "just worked" without crashing the rest of it...


On Thu, Nov 27, 2014 at 12:19 PM, Kyle Kastner <kastnerk...@gmail.com>
wrote:

> If your data is really, really sparse in the original space, you might
> also look at taking a random projection (I think projecting on sparse SVD
> basis would work too?) as preprocessing to "densify" the data before
> calling the cosine similarity. You might get a win on feature size with
> this, depending on the input size at the cost being an approximation rather
> than the real deal.
>
> It kind of looks like the code assumes sparse inputs will still result in
> a dense Gram matrix, which means the matrices are assumed not to be
> orthogonal (is *almost* orthogonal good enough to get mostly sparse in this
> case? or must it be exact) *if* I am thinking correctly.
>
> Kyle
>
> On Thu, Nov 27, 2014 at 11:42 AM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
>> Dear Ian,
>>
>> I guess this comes from the assumption that these types of pairwise
>> similarity matrices have been dense in many usecases. E.g. a gaussian
>> kernel matrix is compeletely dense. Many kernel methods also expect dense
>> input. But it is true that this latter fact shouldn't necessarily be
>> imposed all the similarity measures if there is a possibility of sparse
>> output...
>>
>> Michael
>>
>> On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>
>>> Hey all. I'm working on distance measurement in a reasonably
>>> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
>>> my 16GB laptop runs out of space, I've walked back through the code and
>>> noticed something I don't understand.
>>>
>>> sklearn.metrics.pairwise_distances('cosine') calls
>>> pairwise.cosine_similarity which takes sparse inputs and preserves their
>>> sparsity until the final call:
>>> def cosine_similarity(X, Y)  # both inputs are csr sparse from a
>>> DictVectorizer(...,sparse=True)
>>>  X_normalized = normalize(...)  # sparse result
>>>  Y_normalized = X_normalized  # as both inputs are the same, still sparse
>>>  K = linear_kernel(X_normalized, Y_normalized)
>>> ->linear_kernel(X_normalized, Y_normalized)
>>>  calls safe_sparse_dot(X, Y.T, dense_output=True)
>>> and then the result is forced to be dense.
>>>
>>> If safe_sparse_dot is called with dense_output=False then I get a sparse
>>> result and everything looks sensible with low RAM usage.
>>>
>>> I'm using 0.15, the current github shows the line:
>>>
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>>>
>>> Was there a design decision to force dense matrices at this point? Maybe
>>> some call paths assume a dense result?
>>>
>>> Ian.
>>>
>>> --
>>> Ian Ozsvald (A.I. researcher)
>>> i...@ianozsvald.com
>>>
>>> http://IanOzsvald.com
>>> http://ModelInsight.io
>>> http://MorConsulting.com
>>> http://Annotate.IO
>>> http://SocialTiesApp.com
>>> http://TheScreencastingHandbook.com
>>> http://FivePoundApp.com
>>> http://twitter.com/IanOzsvald
>>> http://ShowMeDo.com
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] pairwise.cosine_similarity(...) takes sparse inputs but forces a dense output?

Reply via email to