[Scikit-learn-general] pairwise.cosine_similarity(...) takes sparse inputs but forces a dense output?

Michael Eickenberg Thu, 27 Nov 2014 09:34:27 -0800

this is more a problem of the sample axis size though: the similiarity
matrix will still be 1.3M x1.3M which should blow up most current memory
sizes for all data types. Sparsity in the output can redeem this.


On Thursday, November 27, 2014, Kyle Kastner <kastnerk...@gmail.com> wrote:

> On a side note, I am semi-surprised that allowing the output of the dot to
> be sparse "just worked" without crashing the rest of it...
>
> On Thu, Nov 27, 2014 at 12:19 PM, Kyle Kastner <kastnerk...@gmail.com>
> wrote:
>
>> If your data is really, really sparse in the original space, you might
>> also look at taking a random projection (I think projecting on sparse SVD
>> basis would work too?) as preprocessing to "densify" the data before
>> calling the cosine similarity. You might get a win on feature size with
>> this, depending on the input size at the cost being an approximation rather
>> than the real deal.
>>
>> It kind of looks like the code assumes sparse inputs will still result in
>> a dense Gram matrix, which means the matrices are assumed not to be
>> orthogonal (is *almost* orthogonal good enough to get mostly sparse in this
>> case? or must it be exact) *if* I am thinking correctly.
>>
>> Kyle
>>
>> On Thu, Nov 27, 2014 at 11:42 AM, Michael Eickenberg <
>> michael.eickenb...@gmail.com> wrote:
>>
>>> Dear Ian,
>>>
>>> I guess this comes from the assumption that these types of pairwise
>>> similarity matrices have been dense in many usecases. E.g. a gaussian
>>> kernel matrix is compeletely dense. Many kernel methods also expect dense
>>> input. But it is true that this latter fact shouldn't necessarily be
>>> imposed all the similarity measures if there is a possibility of sparse
>>> output...
>>>
>>> Michael
>>>
>>> On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>>
>>>> Hey all. I'm working on distance measurement in a reasonably
>>>> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
>>>> my 16GB laptop runs out of space, I've walked back through the code and
>>>> noticed something I don't understand.
>>>>
>>>> sklearn.metrics.pairwise_distances('cosine') calls
>>>> pairwise.cosine_similarity which takes sparse inputs and preserves
>>>> their sparsity until the final call:
>>>> def cosine_similarity(X, Y)  # both inputs are csr sparse from a
>>>> DictVectorizer(...,sparse=True)
>>>>  X_normalized = normalize(...)  # sparse result
>>>>  Y_normalized = X_normalized  # as both inputs are the same, still
>>>> sparse
>>>>  K = linear_kernel(X_normalized, Y_normalized)
>>>> ->linear_kernel(X_normalized, Y_normalized)
>>>>  calls safe_sparse_dot(X, Y.T, dense_output=True)
>>>> and then the result is forced to be dense.
>>>>
>>>> If safe_sparse_dot is called with dense_output=False then I get a
>>>> sparse result and everything looks sensible with low RAM usage.
>>>>
>>>> I'm using 0.15, the current github shows the line:
>>>>
>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>>>>
>>>> Was there a design decision to force dense matrices at this point?
>>>> Maybe some call paths assume a dense result?
>>>>
>>>> Ian.
>>>>
>>>> --
>>>> Ian Ozsvald (A.I. researcher)
>>>> i...@ianozsvald.com
>>>>
>>>> http://IanOzsvald.com
>>>> http://ModelInsight.io
>>>> http://MorConsulting.com
>>>> http://Annotate.IO
>>>> http://SocialTiesApp.com
>>>> http://TheScreencastingHandbook.com
>>>> http://FivePoundApp.com
>>>> http://twitter.com/IanOzsvald
>>>> http://ShowMeDo.com
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>> more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] pairwise.cosine_similarity(...) takes sparse inputs but forces a dense output?

Reply via email to