On Fri, Jan 1, 2010 at 3:24 PM, Grant Ingersoll <[email protected]> wrote:
> > On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote: > > > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[email protected] > >wrote: > > > >> > >> I would like to give some feedback. And ask some questions as well :). > >> > > > > Thank you! > > > > Very helpful feedback. > > > > > >> ... Carrot2 for 2 weeks ... has great level of > >> usability and simplicity but ...I had to give up on it since my very > first > >> practical clustering task required to cluster 23K+ documents. > > > > > > Not too surprising. > > Right, Carrot2 is designed for clustering search results, and of that > mainly the title and snippet. While it can do larger docs, they are > specifically not the target. Plus, C2 is an in-memory tool designed to be > very fast for search results. > > > > > > > >> ... > >> I have managed to do some clustering on my 23 000+ docs with > Mahout/k-means > >> for something like 10 min (in standalone mode - no parallel processing > at > >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) > but > >> I > >> am still learning and still trying to analyze if the result clusters are > >> really meaningful for my docs. > >> > > > > I have seen this effect before where a map-reduce program run > sequentially > > is much faster than an all-in-memory implementation. > > > > > >> One thing I can tell already now is that I definitely, desperately, need > >> word-stopping > > > > > > You should be able to do this in the document -> vector conversion. You > > could also do this at the vector level by multiplying the coordinates of > all > > stop words by zero, but that is not as nice a solution. > > Right, or if you are using the Lucene extraction method, at Lucene indexing > time. > > Ok, so it seems I have to use the stop wording feature of Lucene itself, right? I just saw there is something about stop words in Lucene but I am yet to find out how to use that capability. > > > > > >> ... But it would be valuable for me to be able > >> to come back later to the complete context of a document (i.e. with the > >> stopwords inside) - maybe it is a question on its own - how can I easily > go > >> back from clusters->original docs (an not just vectors), I do not know > >> maybe > >> some kind of mapper which maps vectors to the original documents somehow > >> (e.g. sort of URL for a document based on the vector id/index or > >> something?). > >> > > > > To do this, you should use the document ID and just return the original > > content from some other content store. Lucene or especially SOLR can > help > > with this. > > Right, Mahout's vector can take labels. > > > > > > >> ... > >> I think I will get better results if I can also apply stemming. What > would > >> be you recommendation when using mahout? Should I do the stemming again > >> somewhere in the input vector forming? > > > > > > Yes. That is exactly correct. > > Again, really easy to do if you use the Lucene method for creating vectors. > > Do you mean I have to apply stemming during the vector creation or already in Lucene indexing? Maybe from clustering POV it is the same but what would you recommend? > > > > It is also really essential for me to have "updateable" algorithms as I > am > >> adding new documents on daily basis, and I definitely like to have them > >> clustered immediately (incrementally) - I do not know if this is what is > >> called "classification" in Mahout and I did not reach these examples yet > (I > >> wanted to really get acquainted with the clustering first). > >> > > > > I can't comment on exactly how this should be done, but we definitely > need > > to support this use case. > > Don't people usually see if the new docs fit into an existing cluster and > if they are a good fit, add them there, otherwise, maybe put them in the > best match and kick off a new job. > > Actually this question goes back to the original attempt - to analyze documents automatically by the machine, and not by people :). One of my goals is to not read the new document but rather the system to tell me if I should read it ;) - e.g. if it gets clustered/classified against given cluster/topic which I am interested (not interested) in I could then take more informed decision whether to read it (not to read it). > > > > > > >> And that is not all - I do not only want to have new documents clustered > >> against existing clusters but what I want in addition is that clusters > >> could > >> actually change with new docs coming. > >> > > > > Exactly. This is easy algorithmically with k-means. It just needs to be > > supported by the software. > > Makes sense and shouldn't be that hard to do. I'd imagine we just need to > be able to use the centroids from the previous run as the seeds for the new > run. > > > > > > >> Of course one could not observe new clusters popping up after a single > new > >> doc is added to the analysis but clusters should really be > >> adaptable/updateable with new docs. > >> > > > > Yes. It is eminently doable. Occasionally you should run back through > all > > of the document vectors so you can look at old documents in light of new > > data but that should be very, very fast in your case. > > I do not know how this updatable clustering works (using previous results as centroids for new clusterings), is there an example I could see in action? Additionally I would like to see an example of how could one combine Canopy and k-means, I just saw this described in theory somehow but could not find an example of it. Best regards, Bogdan
