I looked at the Carrot2 docs which mentioned dimension reduction via singular value decomposition (SVD) .. and other forms too I think.

Question: Does anyone have pointers to successful clustering techniques used with lucene? I'm particularly interested in 2D and 3D graphics as well, possibly SOM (Self Organizing Maps).

I'm hoping to combine lucene with a graphical auto-clustering stunt of some kind but am not sure how to do it yet.

Owen


From: Akmal Sarhan <[EMAIL PROTECTED]>
Date: January 28, 2005 8:19:03 AM MST
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia


Hello,

we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
is there any intentions to have any releases in the near future?

thanks

Akmal
Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
Hi David,

I apologize about the delay in answering this one, Lucene is a busy
mailing list and I had a hectic last week... Again, sorry for belated
answer, hope you still find it useful.

That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and looks very good. I'm also glad you spent some time working out Carrot integration with Lucene. It works quite nice.

Carrot2 looks very interesting. Wondering if anybody has a list of all
the

Technically I don't think carrot2 uses lucene per-se- it's just that you
can integrate the two, and ditto for Nutch - it has code that uses Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it
merely takes the output from a query (titles, urls and snippets) and
attempts to cluster them into some sensible groups. I think many things
could be improved, the most important of them is fast snippet retrieval
from Lucene because right now it takes 50% of the time of the
clustering; I've seen a post a while ago describing a faster snippet
generation technique, I'm sure that would give clustering a huge boost
speed-wise.


And here's my question. I reread the Carrot2<->Lucene code, esp
Demo.java, and there's this fragment:

    // warm-up round (stemmer tables must be read etc).
    List clusters = clusterer.clusterHits(docs);

    long clusteringStartTime = System.currentTimeMillis();
    clusters = clusterer.clusterHits(docs);
    long clusteringEndTime = System.currentTimeMillis();

Thus it calls clusterHits() twice.

I don't really understand how to use Carrot2 - but I think the above is
just for the sake of benchmarking clusterHits() w/o the effect of 1-time
initialization - and that there's no benefit of repeatedly calling
clusterHits (where a benefit might be that it can find nested clusters
or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people
that the clustering needs to be warmed up a bit. I should not have put
it in the code knowing people would be confused by it. You can safely
use clusterHits just once. It will just have a small delay at the first
invocation.



Thanks for experimenting. Please BCC me if you have any urgent projects
-- I read Lucene's list in batches and my personal e-mail I try to keep up to date with.


Dawid




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to