Re: Practical Advice on Clustering

Vaijanath N. Rao Sun, 12 Oct 2008 21:17:57 -0700

Hi Grant,

My replies are inline.


Grant Ingersoll wrote:

I'm looking into adding document clustering capabilities to Solr,using Mahout [1][2]. I already have search-results clustering, thanksto Carrot2. What I'm looking for is practical advice on deploying asystem that is going to cluster potentially large corpora (but nothuge, and let's assume one machine for now, but it shouldn't matter)
Here are some thoughts I have:
In Solr, I expect to send a request to go off and build the clustersfor some non-trivial set of documents in the index. The actualbuilding needs to happen in a background thread, so as to not hold upthe caller.

Bingo It's better to spawn a new process for clustering rather than tohold up the caller. If there is a status page indicating the status ofthis clustering algorithm it would be better as the caller can thancheck against this status page to know what the current status is.

My thinking is the request will come in and spawn off a job that goesand calculates a similarity matrix for all the documents in the set(need to store the term vectors in Lucene) and then goes and runs theclustering job (user configurable, based on the implementations wehave: k-means, mean-shift, fuzzy, whatever) and stores the resultsinto Solr's data directory somehow (so that it can be replicated, butnot a big concern of mine at the moment)

If we are going to work on similarity matrix would like to add FIHC(Frequent Item set Hierarchical clustering) If you need I can definitelypitch in with this. Ideally we should target replication and I think theidea is good.

Then, at any time, the application can ask Solr for the clusters(whatever that means) and it will return them (docids, fields,whatever the app asks for). If the background task isn't done yet,the results set will be empty, or it will return a percentagecompletion or something useful.

In my opinion it is better to return the percentage of completion ratherthan the top clusters at time X if the clustering is not yet over. Inmost clustering cases the input data decides the centroid of theclusters so change in input mite change the centroid and you mite getdifferent results for different input sample derived from same data set.

Obviously, my first step is to get it working, but...
Is it practical to return a partially done set of results? i.e. thebest clusters so far, with perhaps a percentage to completion value orperhaps a list of the comparisons that haven't been done yet?
What if something happens? How can I make Mahout fault-tolerant, suchthat, conceivably I could pick up the job again from where it wentdown, or at least be able to get the clusters so far. How do peopleapproach this to date (w/ or w/o Mahout) What needs to be done inMahout to make this possible? I suspect Hadoop has some support for it.

Not sure weather Mahoot is fault tolerant in that respect. But I guessother members can comment on this.

Anything else I don't know?  Does what I'm thinking about make sense?

Thanks for any insight,
Grant



[1] http://wiki.apache.org/solr/ClusteringComponent
[2] https://issues.apache.org/jira/browse/SOLR-769


--Thanks and Regards
Vaijanath

Re: Practical Advice on Clustering

Reply via email to