Re: Document Clustering

Grant Ingersoll Wed, 08 Jul 2009 05:07:59 -0700


On Jul 7, 2009, at 6:10 PM, Ted Dunning wrote:

Paul,
You are doing a fine job so far. Typically, I would recommend youstickwith the Document x Term matrix formulation. That form would allowyou to
use kmeans almost directly.
The good news for you is that Grant is working on an extendedexample to do
just what you are looking to do.  I expect he will comment shortly.


On vacation...  I will try to find some time soon.

On Tue, Jul 7, 2009 at 2:37 PM, Paul Ingles <[email protected]>wrote:
Hi,

I've been doing some reading through the archives to search for some
inspiration with a problem I've been attempting to solve at work,and washoping I could share where my head's at and get some pointers forwhere togo. Since we're looking at clustering between 17m and 70mdocuments, we're
looking to implement this in Hadoop.
We're trying to build clusters of (relatively small) documents,based onthe words/terms they contain. Clusters will probably range in sizebetween10 and 1000 documents. Clusters should ultimately be populated bydocumentsthat contain almost identical terms (nothing clever like stemmingdone so
far).
So far, I've been working down the pairwise similarity route. So,using afew MapReduce jobs we produce something along the lines of thefollowing:
Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:

With a row per-document, containing a vector of tuples for related
documents, and a similarity score. Viewed another way, it's thetypical
matrix:

             A        B       C
-----+------+------+-----+
A    |          |   0.1 |   0.4
B    |          |          |   0.5

etc. A higher number means a more closely related document.
I've been trying to get my head around how to cluster these in aset ofMapReduce jobs and I'm not quite sure of how to proceed: theexamples I've
read around kmeans, canopy clustering etc. all seem to work on
multidimensional (numerical) data. Given the data above, is it evenpossibleto adapt the algorithm? The potential centroids in the exampleabove would
be the documents, and I just can't get my head around applying the
algorithms to this kind of data.
I guess the alternative would be to step back to producing a matrixof
terms x documents:

          Ta     Tb     Tc
-----+------+------+-----+
Da  |          |   0.1 |   0.4
Db  |          |          |   0.5
And then cluster based on this? This seems similar in structure tothe user
x movie recommendation matrix that's often used?
I'd really appreciate people's thoughts- am I thinking the rightway? Is
there a better way?

Thanks,
Paul
--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Document Clustering

Reply via email to