I don't think that the clustering stuff will do the tf-idf weighting or cosine norm.
>From there, any of the clustering algorithms should be happy. (that is, what you said is just right) On Thu, May 28, 2009 at 12:21 PM, Grant Ingersoll <[email protected]>wrote: > Isn't this what Mahout's clustering stuff will do? In other words, if I > calculate the vector for each document (presumably removing stopwords), > normalize it, where each cell is the weight (presumably TF/IDF) and then put > that into a matrix (keeping track of labels), I should then be able to just > run any of Mahout's clustering jobs on that matrix using the appropriate > DistanceMeasure implementation, right? Or am I missing something? > > > On May 28, 2009, at 11:55 AM, Ted Dunning wrote: > > Generally the first step for document clustering is to compute all >> non-trivial document-document similarities. A good way to do that is to >> strip out kill words from all documents and then do a document level >> cross-occurence. In database terms, if we think of documents as docid, >> term >> pairs, this step consists of joining this document table to itself to get >> document-document pairs for all documents that share terms. In detail, >> starting with a term weight table and a document table: >> >> - join term weight to document table to get (docid, term, weight)* >> >> - optionally normalize term weights per document by summing weights or >> squared weights by docid and joining back to the weighted document table. >> >> - join result to itself dropping terms and reducing on docid to sum >> weights. This gives (docid1, docid2, sum_of_weights, >> number_of_occurrences). This sum can be weights or squared weights. >> Accumulating the number of coocurrences helps in computing the average. >> >> >> From here, there are a number of places to go, but the result we have here >> is essentially a sparse similarity matrix. If you have document >> normalization, then document similarity can be converted to distance >> trivially. >> >> On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll <[email protected] >> >wrote: >> >> It sounds like a start. Can you open a JIRA and attach a patch? I >>> still >>> am not sure if Lucene is totally the way to go on it. I suppose >>> eventually >>> we need a way to put things in a common format like ARFF and then just >>> have >>> transformers to it from other formats. Come to think of it, maybe it >>> makes >>> sense to have a Tika ContentHandler that can output ARFF or whatever >>> other >>> format we want. This would make translating input docs dead simple. >>> >>> Then again, maybe a real Pipeline is the answer. I know Solr, etc. could >>> benefit from one too, but that is a whole different ball of wax. >>> >>> >>> >>> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote: >>> >>> Hi Grant, >>> >>>> >>>> I have the code to create lucene index from document text and then >>>> generate document vectors from it. This is stand-alone code and not >>>> MR. Is it something that interests you? >>>> >>>> --shashi >>>> >>>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]> >>>> wrote: >>>> >>>> I'm about to write some code to prepare docs for clustering and I know >>>>> at >>>>> least a few others on the list here have done the same. I was >>>>> wondering >>>>> if >>>>> anyone is in the position to share their code and contribute to Mahout. >>>>> >>>>> As I see it, we need to be able to take in text and create the matrix >>>>> of >>>>> terms, where each cell is the TF/IDF (or some other weight, would be >>>>> nice >>>>> to >>>>> be pluggable) and then normalize the vector (and, according to Ted, we >>>>> should support using different norms). Seems like we also need the >>>>> label >>>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but >>>>> I'm >>>>> not >>>>> sure on the state of that patch. >>>>> >>>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, >>>>> but >>>>> it >>>>> needs to be a more generic. I realize we could use Lucene, but having >>>>> a >>>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a >>>>> M/R >>>>> job seems more straightforward. >>>>> >>>>> I'd like to be able to get this stuff committed relatively soon and >>>>> have >>>>> the >>>>> examples for other people. My shorter term goal is I'm working on some >>>>> demos using Wikipedia. >>>>> >>>>> Thanks, >>>>> Grant >>>>> >>>>> -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
