Olivier Grisel wrote: <snip> > +1 for the dense case > > But ball tree does not work for high dim sparse data. > I'm working on that - I hope to have a pull request within the next few weeks. > We would also need some truncated kernels (e.g. cosine similarity for > positive data or RBF in the general case) probably implemented in > cython for the high dim sparse case where the dense output shape > (n_samples, n_neighbors) is preallocated in advance (and assumed to > fit in memory while a dense array for (n_samples, n_samples) or > (n_samples, n_features) would not). > > That would be very useful to make SpectralClustering work on text > data. That should also help with the "over-convergence" issues I > observe on the power iteration clustering branch when n_samples is too > big. > > Using LSH (or some variant of random projection) might indeed > interesting to quickly the approximate nearest neighbors graph of high > dim sparse data (but I think a cython version for the exact case > truncated case would still be useful, at least as a control reference > for the approximate case). > > BTW, I am making some progress on the Random Projection branch: I have > started integrating murmurhash to simulate random projection by a > sparse matrix that is never materialized in memory. The example looks > good too. It still need some work on the hashing part and on the > narrative doc. > >
------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
