Apple Mail uses latent semantic analysis for clustering
That sounds right. Some people there were looking at that for document retrieval when I worked at Apple Research in the mid-90's.
By the way, have you seen the work applying cased-based reasoning to spam filtering? There are two articles on that at
http://www.cs.tcd.ie/publications/tech-reports/tr-index.04.html
with a bit more at the home page of one of the authors:
http://www.comp.dit.ie/sjdelany/
I've been thinking about whether there might be benefit in making a finer distinctions than just spam or not-spam, by clustering into perhaps spam topics. Why should the characteristics for porn spam, multilevel marketing spam, Nigerian 419, etc., be combined? Would there be benefit from making their differences explicit?
-- sidney
