Re: Term Weights and Clustering

Dawid Weiss Thu, 24 Feb 2005 03:53:43 -0800


Hi Owen,

I'm from the Carrot2 project, so I feel called to the blackboard:

One source for how to do this is the thesis of Stanislaw Osinski and others like it: http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm And the Carrot2 project which uses similar techniques. http://www.cs.put.poznan.pl/dweiss/carrot/

Staszek Osinski is the author of Lingo, the best clustering algorithm available in Carrot2 -- we still work together in that project... In other words, Carrot2 doesn't use 'similar' techniques. It uses _the_ techniques described in the above thesis (and other various papars, see my Web page).

My problem is simple: I need a fairly clear discussion on exactly how to generate the labels, and to assign documents to them. The thesis is quite good, but I'm not sure I can reduce it to practice in the 2-3 days I have to evaluate it! Lucene has made the TDM easy to calculate, but I basically don't know what to do next!

You can use Carrot2 directly for that. There are a few options. One thing is you can directly feed your input collection to the clustering component (it will take a while, but should work) -- you need to write a custom input component, but it is a very simple thing to do and I'm sure if you write to Carrot2 mailing list there will be somebody willing to help (like myself or Staszek ;).

Another option is: use Lucene to index your documents. Set up Carrot2 to use Lucene (described somewhere on this list, see David Spencer's message).

a quick way to get a demo on the air? For example, I don't seem to be able to ask Carrot2 to do a Google "site" search.

Yep, there is a problem with it. Post a bug report to carrot2 bugzilla, please. I'll investigate it when I have time.

simply aim Carrot2 at my collection with a very general search and see what clusters it discovers. This may be a gross misuse of Carrot2's clustering anyway, so could easily be a blind alley.

It kind of is because carrot2 clustering components work primarily with _short_, scarce information sources, such as snippets. We don't intend to work on large, raw documents collections... Having said that, a 1200 documents isn't that much and you should be able to get your clusters.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term Weights and Clustering

Reply via email to