Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

Dawid Weiss Mon, 02 Aug 2004 06:51:24 -0700

a) you're working with an older implementation of the clustering algorithm; the newer one should be faster (don't know whether it is going to be more accurate, but we hope so)
Is the new one available now? With current implementation, there is
noticeable waiting period between request and return of search/clustering
result. What does "lingo-nmf-km-3" mean?

The current implementation is Lingo in its original form (based on SVD), but using the new "local interfaces" component binding architecture. Lingo-nmf-km-3 is an abbreviation used for one of the other versions of Lingo, which utilizes Non-negative matrix decomposition. I did not include that component yet because it is still in beta stage.

b) We don't use external ontologies or knowledge like Vivisimo or SkaneT. I think we should have a way of incorporating them somehow in the future.
Any suggestion?

Not really, just pointing out the fact. The suggestion is: we (meaning Carrot2) have to work on it. The constraint: time.

c) The user interface I proposed was a 30-minute draft just to show the capabilities of the plugin. I think the user interface should be though over again and rewritten.
I am thinking, maybe, a link "clustering on/off" is added
to the right of nutch "Search" button, with "clustering off" as default.
Also currently, the "Next" button does not work as intended.

An on/off checkbox, right? A good idea. Now, what's wrong with the next button? It moves the display window to an incorrect position?

>>[snip]

design this, so if you could take a look at what I proposed and maybe try to fit it in with the distributed searchers then it would be awesome.
A direct hack is to add a new op code "OP_CLUSTER" (or similar) and
supporting codes to DistributedSearch.java and related. We will need advice
from Doug and other developers.


Ok, I'm leaving it to you then.

Any comparison report on clustering algorithms used in
carrot2? What is used in your patch for nutch, lingo?

I put the original flavor of Lingo and some accompanying components that it needs. Our experience shows this algorithm was most successful in most tests/ controlled studies etc. It suffers a speed penalty because of the use of SVD, though... the new implementation allows you to substitute matrix computations with a native library, which speeds up things considerably. There are other improvements, but as I said, for the moment we consider them in beta stage and we'd rather not release it yet.

A comparison of algorithms is being prepared by my colleague Staszek Osinski. I'll let you know when it is available.

Dawid: in your patch, the use of 'tab' is not consistent. It will be great
if you can fix this along with the backslash one. However, more importantly,
a better (more complete) cluster.jsp is needed.

Always the same pain :) Does Nutch have a style file for automatic code layouting for any of the available tools out there? It speeds things up considerably. If not, I'll convert it manually -- the tabs are two spaces, right?

Dawid


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

Reply via email to