Re: Document Clustering

Ted Dunning Thu, 28 May 2009 16:26:34 -0700

I don't think that the clustering stuff will do the tf-idf weighting or
cosine norm.


>From there, any of the clustering algorithms should be happy.

(that is, what you said is just right)

On Thu, May 28, 2009 at 12:21 PM, Grant Ingersoll <[email protected]>wrote:

> Isn't this what Mahout's clustering stuff will do?  In other words, if I
> calculate the vector for each document (presumably removing stopwords),
> normalize it, where each cell is the weight (presumably TF/IDF) and then put
> that into a matrix (keeping track of labels), I should then be able to just
> run any of Mahout's clustering jobs on that matrix using the appropriate
> DistanceMeasure implementation, right?  Or am I missing something?
>
>
> On May 28, 2009, at 11:55 AM, Ted Dunning wrote:
>
>  Generally the first step for document clustering is to compute all
>> non-trivial document-document similarities.  A good way to do that is to
>> strip out kill words from all documents and then do a document level
>> cross-occurence.  In database terms, if we think of documents as docid,
>> term
>> pairs, this step consists of joining this document table to itself to get
>> document-document pairs for all documents that share terms.  In detail,
>> starting with a term weight table and a document table:
>>
>>    - join term weight to document table to get (docid, term, weight)*
>>
>>    - optionally normalize term weights per document by summing weights or
>> squared weights by docid and joining back to the weighted document table.
>>
>>    - join result to itself dropping terms and reducing on docid to sum
>> weights.  This gives  (docid1, docid2, sum_of_weights,
>> number_of_occurrences).  This sum can be weights or squared weights.
>> Accumulating the number of coocurrences helps in computing the average.
>>
>>
>> From here, there are a number of places to go, but the result we have here
>> is essentially a sparse similarity matrix.  If you have document
>> normalization, then document similarity can be converted to distance
>> trivially.
>>
>> On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll <[email protected]
>> >wrote:
>>
>>  It sounds like a start.  Can you open a JIRA and attach a patch?   I
>>> still
>>> am not sure if Lucene is totally the way to go on it.  I suppose
>>> eventually
>>> we need a way to put things in a common format like ARFF and then just
>>> have
>>> transformers to it from other formats.  Come to think of it, maybe it
>>> makes
>>> sense to have a Tika ContentHandler that can output ARFF or whatever
>>> other
>>> format we want.  This would make translating input docs dead simple.
>>>
>>> Then again, maybe a real Pipeline is the answer.  I know Solr, etc. could
>>> benefit from one too, but that is a whole different ball of wax.
>>>
>>>
>>>
>>> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>>>
>>> Hi Grant,
>>>
>>>>
>>>> I have the code to create lucene index from document text and then
>>>> generate document vectors from it.  This is stand-alone code and not
>>>> MR.  Is it something that interests you?
>>>>
>>>> --shashi
>>>>
>>>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]>
>>>> wrote:
>>>>
>>>>  I'm about to write some code to prepare docs for clustering and I know
>>>>> at
>>>>> least a few others on the list here have done the same.  I was
>>>>> wondering
>>>>> if
>>>>> anyone is in the position to share their code and contribute to Mahout.
>>>>>
>>>>> As I see it, we need to be able to take in text and create the matrix
>>>>> of
>>>>> terms, where each cell is the TF/IDF (or some other weight, would be
>>>>> nice
>>>>> to
>>>>> be pluggable) and then normalize the vector (and, according to Ted, we
>>>>> should support using different norms).   Seems like we also need the
>>>>> label
>>>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but
>>>>> I'm
>>>>> not
>>>>> sure on the state of that patch.
>>>>>
>>>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver,
>>>>> but
>>>>> it
>>>>> needs to be a more generic.  I realize we could use Lucene, but having
>>>>> a
>>>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a
>>>>> M/R
>>>>> job seems more straightforward.
>>>>>
>>>>> I'd like to be able to get this stuff committed relatively soon and
>>>>> have
>>>>> the
>>>>> examples for other people.  My shorter term goal is I'm working on some
>>>>> demos using Wikipedia.
>>>>>
>>>>> Thanks,
>>>>> Grant
>>>>>
>>>>>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Document Clustering

Reply via email to