Am 26.10.2012 14:27, schrieb Olivier Grisel: > 2012/10/26 Philipp Singer <kill...@gmail.com>: >> Hey there! >> >> Currently I am working on very large sparse vectors and have to >> calculate similarity between all pairs of them. > How many features? Are they sparse? If so which sparsity level?
In detail: I have a large co-occurrence matrix with a shape of around 3.7Mill x 3.7Mill. Yes, they are sparse, but I can't tell you the exacty sparsity level right now, but as it seems they should be very sparse because a single element does not have a co-occurrence count to a large number of other elements in my case. The "problem" is that I need cosine similarity in my case, so I also can't use the specific suitable implementations of distances available in numpy, scipy or scikit-learn, but I just pass over a callable function that does the job. (Currently, I am using a complete own implementation for this, because it is just impossible to calculate all-pairs-similarity for my large data at the moment) > >> I have now looked into the available code in scikit-learn and also at >> corresponding literature. >> So I stumbled upon this paper [1] and the corresponding implementation [2]. >> >> I was now thinking, if this would be a potential improvement / help for >> scikit-learn for working with very large feature files where it is still >> necessary to calculate the pair-wise similarity of vectors for different >> classificators or other tasks. So the goal would be to speed this whole >> thing up. >> >> I am by far no expert in this thing, but just wanted to ask you guys >> about your opinion ;) > Computing the sparse cosine similarity matrix of a large (both > n_samples and n_features) is really lacking in scikit-learn and I > wanted to implement some tools to do this efficiently when working on > my power iteration clustering pool request some time ago but never > found the time to do it. > > My idea was to use an in-memory inverted index structure, similar to > fulltext indexer such as lucene but using integer feature indices > rather than string feature names / tokens. > > Such a data structure would also be interesting for the > sklearn.neighbors to do efficient k-nearest neighbors multiclass or > multilabel classification on high dimensional sparse data (which we > don't address efficiently with the current BallTree datastructure that > is optimal for less than 100 dense features). That would be awesome as I already had the impression that k-nearest neighbors works very slow for large data in scikit-learn and that was also the link to classification I made above for which this would be helpful to. > >> [1] http://www.bayardo.org/ps/www2007.pdf >> [2] http://code.google.com/p/google-all-pairs-similarity-search/ > Thanks for the links, added them to my reading list. Perfect ;) Regards, Philipp > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general