Re: Document Clustering

Dawid Weiss Tue, 08 Feb 2005 01:48:55 -0800


Hi Owen,

Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true?

Yes, Carrot2 should help you with this. The labels it creates highly depend on the quality of the input snippets, but the so-called KWIK snippets (keyword in context) should suffice (see David Spencer's example with Wikipedia).

There is one thing, though: what is employed in Carrot2 is an on-line unsupervised clusterer that is designed to work with small number of documents and incomplete descriptions (snippets versus full text documents). It will _not_ work for large document collections (thousands of documents) simply because it was not designed to do that. I guess you could try with up to 500 snippets -- beyond that, you'll be waiting for the result forever.

There is a great number of algorithms that can cluster large document collections -- see proceedings from information retrieval conferences for example.

As for David's hints:

> I'm not sure what the complexity of the algorithm is, but for me ~100 > docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

Yes, 100 to 200 snippets is optimal with the open source clustering algorithm. We have a refactored and optimized version of the Lingo clusterer that is commercial (it also provides hierarchical clustering capability as an add-on to the open source component). But even the commercial version will only cluster up to 500 -- 1000 snippets. As I said, it was not our goal to cluster document collections, rather to retrieve useful information from preprocessed snippets.

Dawid


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document Clustering

Reply via email to