Hi John,

It's great. I really like it. Tried on two repositories
(one mainly http urls and the other ftp urls, both are of a million urls),
the benefits are immediate. I think this stuff should be in nutch.

Glad you liked it. Be aware that:

a) you're working with an older implementation of the clustering algorithm; the newer one should be faster (don't know whether it is going to be more accurate, but we hope so)

b) We don't use external ontologies or knowledge like Vivisimo or SkaneT. I think we should have a way of incorporating them somehow in the future.

c) The user interface I proposed was a 30-minute draft just to show the capabilities of the plugin. I think the user interface should be though over again and rewritten.

Anyway, I also think this could be a nice add-on to Nutch even in the form it is available today.

As for technical difficulties, I thought about one thing today: Nutch supports multiple (distributed) searchers, right? The extension I proposed is used from the JSP page directly, which is probably not good (the Web container serving JSPs will do all the clustering work). It would make more sense to distribute the clusterer together with searchers, or to distribute the clusterer even to a different set of machines... I don't have enough knowledge of Nutch internals to actually design this, so if you could take a look at what I proposed and maybe try to fit it in with the distributed searchers then it would be awesome.

When applying the patch, there was a little problem for me,
who uses linux as developing platform: the patch has file paths
using `\` as separator and it does not apply right way. But this is
a non-issue, I solved it by replacing them with forward ones
(is there a better way?).

Yeah, I could ask the same question -- is there a way to generate a patch using windows' distro of unxutils and get those slashes right... I apologize about it. Next time I'll just move the codebase to my Linux machine and create a patch there.


Any document or note for us to jump start?

Tell me what you need. If you're interested in the algorithmic background, then I have a list of publications on my Web site -- http://www.cs.put.poznan.pl/dweiss/xml/publications/index.xml, plus, there is a dozen more I can offer you. If you're trying to play with Carrot2, then we have a 'manual' on the project's Web site... but as with more open source projects, there may be inacurracies and black holes because we usually prefer to work on the code than on the docs :)
The manual is at: http://www.cs.put.poznan.pl/dweiss/carrot/xml/developers/index.xml?lang=en


If you'd like to tune the clustering component.. well, don't. Not yet. Let's work on the extension api first, then, once it stabilizes, I will probably rebuilt the extension to include a newer set of components from Carrot2 and provide instructions on how to tune it.

Oh, maybe I wasn't clear on this one: I think my contribution to Nutch will be mostly in terms of providing an extension API for clustering plugins and a sample implementation of this extension using Carrot2 components... But I also think it would not make much sense to import all the sources of those components into Nutch CVS. While you obviously could do it (and I mean: I have no objections about that), I believe that compiled JARs of Carrot2 modules will be easier to maintain, use and probably play with if you want to experiment with other Carrot2 clustering components (the one I provided was Lingo, we have a couple different ones too). The selected components are already quite heavy in terms of size... Doug suggested that a separate repository should be created for 'heavy' plugins like this and I agree with him.

Dawid




------------------------------------------------------- This SF.Net email is sponsored by OSTG. Have you noticed the changes on Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, one more big change to announce. We are now OSTG- Open Source Technology Group. Come see the changes on the new OSTG site. www.ostg.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to