Hi Shankar,

My problem is specifically about how to compute the similarity (I'm using
cosine sim but I guess it could apply to other metrics as well) of objects
having widely varying numbers of nonzero features.

Although I'm not dealing with textual data, an easy analogy would be a
dataset of documents of which the bulk would be only a few sentence long.
But a few of them would also be very long.. so what would be a sound way to
compare the documents across their range of sizes?



On 23 April 2014 05:33, Shankar Satish <mailsh...@yahoo.co.in> wrote:

> Is your problem figuring out a good similarity measure, or dealing with
> large quantities of sparse data in a memory efficient way? If it is the
> latter, you can look into feature hashing:
> http://en.wikipedia.org/wiki/Feature_hashing
>
> regards
> shankar.
>
>
>
>
> On Wed, Apr 23, 2014 at 9:59 AM, Christian Jauvin <cjau...@gmail.com>wrote:
>
>> Hi,
>>
>> I want to compute the pairwise cosine similarity of items in a vector
>> space of a very high dimensionality .
>>
>> My input matrix is very sparse, but the number of nonzero elements per
>> item follows a very skewed distribution (i.e. power law-ish, with very
>> few items having lots of features, and vice versa).
>>
>> Intuitively, comparing items with very different numbers of features
>> doesn't seem very desirable, but the only idea I got to mitigate this
>> problem is to partition my input matrix in "bands of items having
>> similar #s of features", which is not obvious to do, given the very
>> skewed distribution.
>>
>> I'd greatly appreciate any idea or suggestion about this problem.
>>
>> Thanks,
>>
>> Christian
>>
>>
>> ------------------------------------------------------------------------------
>> Start Your Social Network Today - Download eXo Platform
>> Build your Enterprise Intranet with eXo Platform Software
>> Java Based Open Source Intranet - Social, Extensible, Cloud Ready
>> Get Started Now And Turn Your Intranet Into A Collaboration Platform
>> http://p.sf.net/sfu/ExoPlatform
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Start Your Social Network Today - Download eXo Platform
> Build your Enterprise Intranet with eXo Platform Software
> Java Based Open Source Intranet - Social, Extensible, Cloud Ready
> Get Started Now And Turn Your Intranet Into A Collaboration Platform
> http://p.sf.net/sfu/ExoPlatform
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to