Re: carrot2 question too - Re: Fun with the Wikipedia

David Spencer Mon, 17 Jan 2005 11:04:10 -0800

Dawid Weiss wrote:

Hi David,
I apologize about the delay in answering this one, Lucene is a busy mailing list and I had a hectic last week... Again, sorry for belated answer, hope you still find it useful.

Oh no problem, and yes carrot2 is useful and fun. It's a rich package so it takes a while to understand all that it can do.

That is awesome and very inspirational!
Yes, I admit what you've done with Wikipedia is quite interesting and looks very good. I'm also glad you spent some time working out Carrot integration with Lucene. It works quite nice.

Thanks but I just took code that I think you wrote(!) and made minor mods to it - here's one link: http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

Carrot2 looks very interesting. Wondering if anybody has a list of all the
Technically I don't think carrot2 uses lucene per-se- it's just that you can integrate the two, and ditto for Nutch - it has code that uses Carrot2.
Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it merely takes the output from a query (titles, urls and snippets) and attempts to cluster them into some sensible groups. I think many things could be improved, the most important of them is fast snippet retrieval from Lucene because right now it takes 50% of the time of the clustering; I've seen a post a while ago describing a faster snippet generation technique, I'm sure that would give clustering a huge boost speed-wise.
And here's my question. I reread the Carrot2<->Lucene code, esp Demo.java, and there's this fragment:
    // warm-up round (stemmer tables must be read etc).
    List clusters = clusterer.clusterHits(docs);
    long clusteringStartTime = System.currentTimeMillis();
    clusters = clusterer.clusterHits(docs);
    long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above is just for the sake of benchmarking clusterHits() w/o the effect of 1-time initialization - and that there's no benefit of repeatedly calling clusterHits (where a benefit might be that it can find nested clusters or whatever) - is that right (that there's no benefit)?
No, there is absolutely no benefit from it. It was merely to show people that the clustering needs to be warmed up a bit. I should not have put it in the code knowing people would be confused by it. You can safely use clusterHits just once. It will just have a small delay at the first invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects -- I read Lucene's list in batches and my personal e-mail I try to keep up to date with.

Dawid


thx,
 Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: carrot2 question too - Re: Fun with the Wikipedia

Reply via email to