Btw... relative to the cost of decomposition, have you seen the recent spate of articles on stochastic decomposition? It can dramatically speed up LSA.
See http://arxiv.org/abs/0909.4061v1 for a good survey. My guess is that you don't even need to do the SVD and could just use a random projection with a single power step (which is nearly equivalent to random indexing). On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <[email protected]> wrote: > We agree, it was just me explaining things vaguely. The bottom line > is: a lot depends on what you're planning to do with the clusters and > the methodology should be suitable to this. > > Dawid > > > On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <[email protected]> wrote: > > I think I agree with this for clusters that are intended for human > > consumption, but I am sure that I disagree with this if you are looking > to > > use the clusters internally for machine learning purposes. > > > > The basic idea for the latter is that the distances to a bunch of > clusters > > can be used as a description of a point. This description in terms of > > distances to cluster centroids can make some machine learning tasks > vastly > > easier. > > > > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <[email protected]> > wrote: > > > >> What's worse -- neither method is "better". We at Carrot2 have a > >> strong feeling that clusters should be described properly in order to > >> be useful, but one may argue that in many, many applications of > >> clustering, the labels are _not_ important and just individual > >> features of clusters (like keywords or even documents themselves) are > >> enough. > >> > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve
