this is more a problem of the sample axis size though: the similiarity
matrix will still be 1.3M x1.3M which blows up. The sparse output redeemed
this.
On Thursday, November 27, 2014, Kyle Kastner <kastnerk...@gmail.com> wrote:
> On a side note, I am semi-surprised that allowing the output of the dot to
> be sparse "just worked" without crashing the rest of it...
>
> On Thu, Nov 27, 2014 at 12:19 PM, Kyle Kastner <kastnerk...@gmail.com
> <javascript:_e(%7B%7D,'cvml','kastnerk...@gmail.com');>> wrote:
>
>> If your data is really, really sparse in the original space, you might
>> also look at taking a random projection (I think projecting on sparse SVD
>> basis would work too?) as preprocessing to "densify" the data before
>> calling the cosine similarity. You might get a win on feature size with
>> this, depending on the input size at the cost being an approximation rather
>> than the real deal.
>>
>> It kind of looks like the code assumes sparse inputs will still result in
>> a dense Gram matrix, which means the matrices are assumed not to be
>> orthogonal (is *almost* orthogonal good enough to get mostly sparse in this
>> case? or must it be exact) *if* I am thinking correctly.
>>
>> Kyle
>>
>> On Thu, Nov 27, 2014 at 11:42 AM, Michael Eickenberg <
>> michael.eickenb...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','michael.eickenb...@gmail.com');>> wrote:
>>
>>> Dear Ian,
>>>
>>> I guess this comes from the assumption that these types of pairwise
>>> similarity matrices have been dense in many usecases. E.g. a gaussian
>>> kernel matrix is compeletely dense. Many kernel methods also expect dense
>>> input. But it is true that this latter fact shouldn't necessarily be
>>> imposed all the similarity measures if there is a possibility of sparse
>>> output...
>>>
>>> Michael
>>>
>>> On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <i...@ianozsvald.com
>>> <javascript:_e(%7B%7D,'cvml','i...@ianozsvald.com');>> wrote:
>>>
>>>> Hey all. I'm working on distance measurement in a reasonably
>>>> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
>>>> my 16GB laptop runs out of space, I've walked back through the code and
>>>> noticed something I don't understand.
>>>>
>>>> sklearn.metrics.pairwise_distances('cosine') calls
>>>> pairwise.cosine_similarity which takes sparse inputs and preserves
>>>> their sparsity until the final call:
>>>> def cosine_similarity(X, Y) # both inputs are csr sparse from a
>>>> DictVectorizer(...,sparse=True)
>>>> X_normalized = normalize(...) # sparse result
>>>> Y_normalized = X_normalized # as both inputs are the same, still
>>>> sparse
>>>> K = linear_kernel(X_normalized, Y_normalized)
>>>> ->linear_kernel(X_normalized, Y_normalized)
>>>> calls safe_sparse_dot(X, Y.T, dense_output=True)
>>>> and then the result is forced to be dense.
>>>>
>>>> If safe_sparse_dot is called with dense_output=False then I get a
>>>> sparse result and everything looks sensible with low RAM usage.
>>>>
>>>> I'm using 0.15, the current github shows the line:
>>>>
>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>>>>
>>>> Was there a design decision to force dense matrices at this point?
>>>> Maybe some call paths assume a dense result?
>>>>
>>>> Ian.
>>>>
>>>> --
>>>> Ian Ozsvald (A.I. researcher)
>>>> i...@ianozsvald.com
>>>>
>>>> http://IanOzsvald.com
>>>> http://ModelInsight.io
>>>> http://MorConsulting.com
>>>> http://Annotate.IO
>>>> http://SocialTiesApp.com
>>>> http://TheScreencastingHandbook.com
>>>> http://FivePoundApp.com
>>>> http://twitter.com/IanOzsvald
>>>> http://ShowMeDo.com
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>> more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> <javascript:_e(%7B%7D,'cvml','Scikit-learn-general@lists.sourceforge.net');>
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> <javascript:_e(%7B%7D,'cvml','Scikit-learn-general@lists.sourceforge.net');>
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general