Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

john Sun, 01 Aug 2004 23:01:27 -0700

Dawid, Doug and all,

On Sat, Jul 31, 2004 at 10:29:19AM -0400, Dawid Weiss wrote:
> Hi John,
> 
> >It's great. I really like it. Tried on two repositories
> >(one mainly http urls and the other ftp urls, both are of a million urls),
> >the benefits are immediate. I think this stuff should be in nutch.
> 
> Glad you liked it. Be aware that:
> 
> a) you're working with an older implementation of the clustering 
> algorithm; the newer one should be faster (don't know whether it is 
> going to be more accurate, but we hope so)


Is the new one available now? With current implementation, there is
noticeable waiting period between request and return of search/clustering
result. What does "lingo-nmf-km-3" mean?

> 
> b) We don't use external ontologies or knowledge like Vivisimo or 
> SkaneT. I think we should have a way of incorporating them somehow in 
> the future.

Any suggestion?

> 
> c) The user interface I proposed was a 30-minute draft just to show the 
> capabilities of the plugin. I think the user interface should be though 
> over again and rewritten.

I am thinking, maybe, a link "clustering on/off" is added
to the right of nutch "Search" button, with "clustering off" as default.
Also currently, the "Next" button does not work as intended.

> 
> Anyway, I also think this could be a nice add-on to Nutch even in the 
> form it is available today.
> 
> As for technical difficulties, I thought about one thing today: Nutch 
> supports multiple (distributed) searchers, right? The extension I 
> proposed is used from the JSP page directly, which is probably not good 
> (the Web container serving JSPs will do all the clustering work). It 
> would make more sense to distribute the clusterer together with 
> searchers, or to distribute the clusterer even to a different set of 
> machines... I don't have enough knowledge of Nutch internals to actually 
> design this, so if you could take a look at what I proposed and maybe 
> try to fit it in with the distributed searchers then it would be awesome.

A direct hack is to add a new op code "OP_CLUSTER" (or similar) and
supporting codes to DistributedSearch.java and related. We will need advice
from Doug and other developers.

> 
> >When applying the patch, there was a little problem for me,
> >who uses linux as developing platform: the patch has file paths
> >using `\` as separator and it does not apply right way. But this is
> >a non-issue, I solved it by replacing them with forward ones
> >(is there a better way?).
> 
> Yeah, I could ask the same question -- is there a way to generate a 
> patch using windows' distro of unxutils and get those slashes right... I 
> apologize about it. Next time I'll just move the codebase to my Linux 
> machine and create a patch there.
> 
> >Any document or note for us to jump start? 
> 
> Tell me what you need. If you're interested in the algorithmic 
> background, then I have a list of publications on my Web site -- 
> http://www.cs.put.poznan.pl/dweiss/xml/publications/index.xml, plus, 
> there is a dozen more I can offer you. If you're trying to play with 

Any comparison report on clustering algorithms used in
carrot2? What is used in your patch for nutch, lingo?

> Carrot2, then we have a 'manual' on the project's Web site... but as 
> with more open source projects, there may be inacurracies and black 
> holes because we usually prefer to work on the code than on the docs :)
> The manual is at: 
> http://www.cs.put.poznan.pl/dweiss/carrot/xml/developers/index.xml?lang=en
> 
> If you'd like to tune the clustering component.. well, don't. Not yet. 
> Let's work on the extension api first, then, once it stabilizes, I will 
> probably rebuilt the extension to include a newer set of components from 
> Carrot2 and provide instructions on how to tune it.
> 
> Oh, maybe I wasn't clear on this one: I think my contribution to Nutch 
> will be mostly in terms of providing an extension API for clustering 
> plugins and a sample implementation of this extension using Carrot2 
> components... But I also think it would not make much sense to import 
> all the sources of those components into Nutch CVS. While you obviously 
> could do it (and I mean: I have no objections about that), I believe 
> that compiled JARs of Carrot2 modules will be easier to maintain, use 
> and probably play with if you want to experiment with other Carrot2 
> clustering components (the one I provided was Lingo, we have a couple 
> different ones too). The selected components are already quite heavy in 
> terms of size... Doug suggested that a separate repository should be 
> created for 'heavy' plugins like this and I agree with him.

Yes, it'd better be separated.
Doug: how should we proceed on this?

If nobody objects, I am willing to work with Dawid on integrating his stuff.
I suggest we have in-JSP clustering first, worry about distributed clustering
later, on which I will give my try.

Dawid: in your patch, the use of 'tab' is not consistent. It will be great
if you can fix this along with the backslash one. However, more importantly,
a better (more complete) cluster.jsp is needed.

John


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

Reply via email to