Ted, how would just doing a random projection do the right thing? It's a basically metric-preserving technique, and one of the primary reasons to *do* LSA is to use a *different* metric (one in which "similar" terms are nearer to each other than would be otherwise imagined).
I've always thought that the primary point of that survey article was to demonstrate how to speed up the LSA, by turning the rank-K decomposition of a sparse N x M matrix into instead the rank-K decomposition of a dense (k+\delta)x(k+\delta) matrix (for \delta of order smaller than k) after doing a single KxN matrix multiplication (well, after in parallel sense). You still want to do the decomposition, because this provides the proper weighting for your dimensions. -jake On Mon, Jan 4, 2010 at 2:12 PM, Ted Dunning <[email protected]> wrote: > Btw... relative to the cost of decomposition, have you seen the recent > spate > of articles on stochastic decomposition? It can dramatically speed up LSA. > > See http://arxiv.org/abs/0909.4061v1 for a good survey. My guess is that > you don't even need to do the SVD and could just use a random projection > with a single power step (which is nearly equivalent to random indexing). > > On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <[email protected]> > wrote: > > > We agree, it was just me explaining things vaguely. The bottom line > > is: a lot depends on what you're planning to do with the clusters and > > the methodology should be suitable to this. > > > > Dawid > > > > > > On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <[email protected]> > wrote: > > > I think I agree with this for clusters that are intended for human > > > consumption, but I am sure that I disagree with this if you are looking > > to > > > use the clusters internally for machine learning purposes. > > > > > > The basic idea for the latter is that the distances to a bunch of > > clusters > > > can be used as a description of a point. This description in terms of > > > distances to cluster centroids can make some machine learning tasks > > vastly > > > easier. > > > > > > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <[email protected]> > > wrote: > > > > > >> What's worse -- neither method is "better". We at Carrot2 have a > > >> strong feeling that clusters should be described properly in order to > > >> be useful, but one may argue that in many, many applications of > > >> clustering, the labels are _not_ important and just individual > > >> features of clusters (like keywords or even documents themselves) are > > >> enough. > > >> > > > > > > > > > > > > -- > > > Ted Dunning, CTO > > > DeepDyve > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve >
