I am planning to take a closer look to the carrot2 implementation and expose
the other algorithms to the user,

That's actually quite simple -- I was planning to do it, but have no time at the moment. The current Carrot2 code in Nutch is a preconfigured process which uses the open source Lingo clustering algorithm to cluster documents. But the the codebase of Carrot2 there is now a scriptable controller, so you could basically have external scripts configuring several different algorithms. It really isn't that difficult. If you need any help, let me know -- private e-mail or the newsgroup, whatever.

changes to the algorithm(s) so that speed wise be as good as vivisimo (not
only interface wise ;-)).

We don't know what Vivisimo algorithm is really like in terms of speed. Its authors and co-funders are excellent researchers, so I guess it will be a tough beast to beat :) But of course we don't have any reasons to be ashamed -- the open source version is quite decent. In the commercial version we refactored the codebase and added an optional native matrix computation library. The speedup is significant (which matters only if your servers are really under a lot of load).

Dawid

Reply via email to