[Nutch-dev] A clarification about Carrot

Dawid Weiss Tue, 17 Feb 2004 02:02:24 -0800

I am a little late with this, but I thought it would be interesting for
those interested in on-line clustering of Nutch results.


So, I'm referring to this post first:
>From: Pierre <[EMAIL PROTECTED]>
> 2004-01-26 04:18

> I  need  1  person with full knowledge of what is going on with nutch and
> http://www.cs.put.poznan.pl/dweiss/carrot/ project,

I  would  be  very  interested  in  providing help to integrate Carrot with
Nutch. I've written an adapter to Nutch already -- there were some problems
with  extracting snippets in text-only form, but I guess this is trivial to
fix locally until a suitable patch in Nutch is committed.

> we  will  be  testing with 10 million urls and then expand on that, which
> means clustering and db distrobution

Well...  the clustering components in carrot don't really scale to millions
of  documents. I mean - they are on-line, on-the-fly clustering methods, so
they work on the result set returned by Nutch, not on the entire collection
of  documents.  More  --  Carrot's  algorithms  are usually focused only on
snippets  returned by a search engine, not entire document bodies.

You can see how Carrot's components fit together with Egothor search engine
on: http://www.egothor.dundee.ac.uk/. For example:

http://www.egothor.dundee.ac.uk/egothor/search.jsp?q=campus&s_ctx=dundee&v_ctx=clustered&l=pl&results=10

Also,  there is an online demo where you can see how Carrot clusters Google
or  AllTheWeb  data...  to  some  limit,  we  don't want to flood them with
requests.

http://carrot.cs.put.poznan.pl

Pierre, please contact me directly if you need anything.

Now, to another post:

> Antonio Gulli <[EMAIL PROTECTED]>.unipi.it>
>  Re: Clustering   
> 2004-01-27 08:06 

> have  realized a prototype of Web Clustering engine which acts on the top
> of  a meta-search engine (currently wrapping Yahoo, Altavista, Google and
> Lycos..  Teoma  and  others  are  in  the  stack.).

So  does  Carrot - nothing new. We even have an adaptive wrapper learner :)
But  direct  snippets  feed  is always more efficient, so tight integration
with Nutch will be better than wrapping its results.

> The  code  is  in  perl (mod_perl+apache), but is not (yet!) a production
> ready one.

Can I see it somewhere? It sounds interesting!

> Carrot is another Grouper (see Etzioni [1] and  Zamir, now at Google)
>  incarnation. It uses suffix tree (STC) for extracting variable length 
>  sentences from text, in linear time, and produces flat hierarchy. Carrot 
>  is  plaining to integrate a SVD based clustering algorithm.

THIS  IS NOT TRUE, and I am surprised you say it after you'd claimed you're
interested  in  search  results clustering. Carrot was an implementation of
STC,  indeed,  but  it  was about 3 years ago. The algorithm we use now has
nothing  to  do  with STC (although STC is still available, just as various
flavors of AHC we have added to the framework for comparison).

Dawid

-- 
Dawid Weiss, http://www.cs.put.poznan.pl/dweiss
Laboratory of Intelligent Decision Support Systems, Poznan UT, Poland



-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] A clarification about Carrot

Reply via email to