I am a little late with this, but I thought it would be interesting for those interested in on-line clustering of Nutch results.
So, I'm referring to this post first: >From: Pierre <[EMAIL PROTECTED]> > 2004-01-26 04:18 > I need 1 person with full knowledge of what is going on with nutch and > http://www.cs.put.poznan.pl/dweiss/carrot/ project, I would be very interested in providing help to integrate Carrot with Nutch. I've written an adapter to Nutch already -- there were some problems with extracting snippets in text-only form, but I guess this is trivial to fix locally until a suitable patch in Nutch is committed. > we will be testing with 10 million urls and then expand on that, which > means clustering and db distrobution Well... the clustering components in carrot don't really scale to millions of documents. I mean - they are on-line, on-the-fly clustering methods, so they work on the result set returned by Nutch, not on the entire collection of documents. More -- Carrot's algorithms are usually focused only on snippets returned by a search engine, not entire document bodies. You can see how Carrot's components fit together with Egothor search engine on: http://www.egothor.dundee.ac.uk/. For example: http://www.egothor.dundee.ac.uk/egothor/search.jsp?q=campus&s_ctx=dundee&v_ctx=clustered&l=pl&results=10 Also, there is an online demo where you can see how Carrot clusters Google or AllTheWeb data... to some limit, we don't want to flood them with requests. http://carrot.cs.put.poznan.pl Pierre, please contact me directly if you need anything. Now, to another post: > Antonio Gulli <[EMAIL PROTECTED]>.unipi.it> > Re: Clustering > 2004-01-27 08:06 > have realized a prototype of Web Clustering engine which acts on the top > of a meta-search engine (currently wrapping Yahoo, Altavista, Google and > Lycos.. Teoma and others are in the stack.). So does Carrot - nothing new. We even have an adaptive wrapper learner :) But direct snippets feed is always more efficient, so tight integration with Nutch will be better than wrapping its results. > The code is in perl (mod_perl+apache), but is not (yet!) a production > ready one. Can I see it somewhere? It sounds interesting! > Carrot is another Grouper (see Etzioni [1] and Zamir, now at Google) > incarnation. It uses suffix tree (STC) for extracting variable length > sentences from text, in linear time, and produces flat hierarchy. Carrot > is plaining to integrate a SVD based clustering algorithm. THIS IS NOT TRUE, and I am surprised you say it after you'd claimed you're interested in search results clustering. Carrot was an implementation of STC, indeed, but it was about 3 years ago. The algorithm we use now has nothing to do with STC (although STC is still available, just as various flavors of AHC we have added to the framework for comparison). Dawid -- Dawid Weiss, http://www.cs.put.poznan.pl/dweiss Laboratory of Intelligent Decision Support Systems, Poznan UT, Poland ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
