Generally the first step for document clustering is to compute all
non-trivial document-document similarities. A good way to do that is to
strip out kill words from all documents and then do a document level
cross-occurence. In database terms, if we think of documents as docid, term
pairs, this step consists of joining this document table to itself to get
document-document pairs for all documents that share terms. In detail,
starting with a term weight table and a document table:
- join term weight to document table to get (docid, term, weight)*
- optionally normalize term weights per document by summing weights or
squared weights by docid and joining back to the weighted document table.
- join result to itself dropping terms and reducing on docid to sum
weights. This gives (docid1, docid2, sum_of_weights,
number_of_occurrences). This sum can be weights or squared weights.
Accumulating the number of coocurrences helps in computing the average.
>From here, there are a number of places to go, but the result we have here
is essentially a sparse similarity matrix. If you have document
normalization, then document similarity can be converted to distance
trivially.
On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll <[email protected]>wrote:
> It sounds like a start. Can you open a JIRA and attach a patch? I still
> am not sure if Lucene is totally the way to go on it. I suppose eventually
> we need a way to put things in a common format like ARFF and then just have
> transformers to it from other formats. Come to think of it, maybe it makes
> sense to have a Tika ContentHandler that can output ARFF or whatever other
> format we want. This would make translating input docs dead simple.
>
> Then again, maybe a real Pipeline is the answer. I know Solr, etc. could
> benefit from one too, but that is a whole different ball of wax.
>
>
>
> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>
> Hi Grant,
>>
>> I have the code to create lucene index from document text and then
>> generate document vectors from it. This is stand-alone code and not
>> MR. Is it something that interests you?
>>
>> --shashi
>>
>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]>
>> wrote:
>>
>>> I'm about to write some code to prepare docs for clustering and I know at
>>> least a few others on the list here have done the same. I was wondering
>>> if
>>> anyone is in the position to share their code and contribute to Mahout.
>>>
>>> As I see it, we need to be able to take in text and create the matrix of
>>> terms, where each cell is the TF/IDF (or some other weight, would be nice
>>> to
>>> be pluggable) and then normalize the vector (and, according to Ted, we
>>> should support using different norms). Seems like we also need the
>>> label
>>> stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but I'm
>>> not
>>> sure on the state of that patch.
>>>
>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but
>>> it
>>> needs to be a more generic. I realize we could use Lucene, but having a
>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R
>>> job seems more straightforward.
>>>
>>> I'd like to be able to get this stuff committed relatively soon and have
>>> the
>>> examples for other people. My shorter term goal is I'm working on some
>>> demos using Wikipedia.
>>>
>>> Thanks,
>>> Grant
>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
--
Ted Dunning, CTO
DeepDyve
111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)