Good point. Didn't think about the fact that it was similarities that blew
up...
Kyle
On Thu, Nov 27, 2014 at 12:33 PM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:
> this is more a problem of the sample axis size though: the similiarity
> matrix will still be 1.3M x1.3M which should blow up most current memory
> sizes for all data types. Sparsity in the output can redeem this.
>
> On Thursday, November 27, 2014, Kyle Kastner <kastnerk...@gmail.com>
> wrote:
>
>> On a side note, I am semi-surprised that allowing the output of the dot
>> to be sparse "just worked" without crashing the rest of it...
>>
>> On Thu, Nov 27, 2014 at 12:19 PM, Kyle Kastner <kastnerk...@gmail.com>
>> wrote:
>>
>>> If your data is really, really sparse in the original space, you might
>>> also look at taking a random projection (I think projecting on sparse SVD
>>> basis would work too?) as preprocessing to "densify" the data before
>>> calling the cosine similarity. You might get a win on feature size with
>>> this, depending on the input size at the cost being an approximation rather
>>> than the real deal.
>>>
>>> It kind of looks like the code assumes sparse inputs will still result
>>> in a dense Gram matrix, which means the matrices are assumed not to be
>>> orthogonal (is *almost* orthogonal good enough to get mostly sparse in this
>>> case? or must it be exact) *if* I am thinking correctly.
>>>
>>> Kyle
>>>
>>> On Thu, Nov 27, 2014 at 11:42 AM, Michael Eickenberg <
>>> michael.eickenb...@gmail.com> wrote:
>>>
>>>> Dear Ian,
>>>>
>>>> I guess this comes from the assumption that these types of pairwise
>>>> similarity matrices have been dense in many usecases. E.g. a gaussian
>>>> kernel matrix is compeletely dense. Many kernel methods also expect dense
>>>> input. But it is true that this latter fact shouldn't necessarily be
>>>> imposed all the similarity measures if there is a possibility of sparse
>>>> output...
>>>>
>>>> Michael
>>>>
>>>> On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <i...@ianozsvald.com>
>>>> wrote:
>>>>
>>>>> Hey all. I'm working on distance measurement in a reasonably
>>>>> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
>>>>> my 16GB laptop runs out of space, I've walked back through the code and
>>>>> noticed something I don't understand.
>>>>>
>>>>> sklearn.metrics.pairwise_distances('cosine') calls
>>>>> pairwise.cosine_similarity which takes sparse inputs and preserves
>>>>> their sparsity until the final call:
>>>>> def cosine_similarity(X, Y) # both inputs are csr sparse from a
>>>>> DictVectorizer(...,sparse=True)
>>>>> X_normalized = normalize(...) # sparse result
>>>>> Y_normalized = X_normalized # as both inputs are the same, still
>>>>> sparse
>>>>> K = linear_kernel(X_normalized, Y_normalized)
>>>>> ->linear_kernel(X_normalized, Y_normalized)
>>>>> calls safe_sparse_dot(X, Y.T, dense_output=True)
>>>>> and then the result is forced to be dense.
>>>>>
>>>>> If safe_sparse_dot is called with dense_output=False then I get a
>>>>> sparse result and everything looks sensible with low RAM usage.
>>>>>
>>>>> I'm using 0.15, the current github shows the line:
>>>>>
>>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>>>>>
>>>>> Was there a design decision to force dense matrices at this point?
>>>>> Maybe some call paths assume a dense result?
>>>>>
>>>>> Ian.
>>>>>
>>>>> --
>>>>> Ian Ozsvald (A.I. researcher)
>>>>> i...@ianozsvald.com
>>>>>
>>>>> http://IanOzsvald.com
>>>>> http://ModelInsight.io
>>>>> http://MorConsulting.com
>>>>> http://Annotate.IO
>>>>> http://SocialTiesApp.com
>>>>> http://TheScreencastingHandbook.com
>>>>> http://FivePoundApp.com
>>>>> http://twitter.com/IanOzsvald
>>>>> http://ShowMeDo.com
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>>> from Actuate! Instantly Supercharge Your Business Reports and
>>>>> Dashboards
>>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>>> more
>>>>> Get technology previously reserved for billion-dollar corporations,
>>>>> FREE
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration &
>>>> more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general