I am planning to take a closer look to the carrot2 implementation and expose
the other algorithms to the user,
That's actually quite simple -- I was planning to do it, but have no
time at the moment. The current Carrot2 code in Nutch is a preconfigured
process which uses the open source Lingo clustering algorithm to cluster
documents. But the the codebase of Carrot2 there is now a scriptable
controller, so you could basically have external scripts configuring
several different algorithms. It really isn't that difficult. If you
need any help, let me know -- private e-mail or the newsgroup, whatever.
changes to the algorithm(s) so that speed wise be as good as vivisimo (not
only interface wise ;-)).
We don't know what Vivisimo algorithm is really like in terms of speed.
Its authors and co-funders are excellent researchers, so I guess it
will be a tough beast to beat :) But of course we don't have any reasons
to be ashamed -- the open source version is quite decent. In the
commercial version we refactored the codebase and added an optional
native matrix computation library. The speedup is significant (which
matters only if your servers are really under a lot of load).
Dawid