Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

Dawid Weiss Sun, 29 Aug 2004 05:16:29 -0700

Hi Antonio,

Well i'm answering a month later and this is quite unusual to me, but i was in a place with bad connection. Sorry Dawid.


Hey, no problem.

This is a nice thing to have. I will add a list of product to compare:
Commercial
1) Vivisimo:  http://vivisimo.com/
2) IBoogie: http://iboogie.tv/
3) Mooter: http://www.mooter.com/
Accademic: 1) Carrot2: http://sourceforge.net/projects/carrot2/ 2) SnakeT: http://roquefort.di.unipi.it/ (and the new experimental beta, http://roquefort.di.unipi.it:8091/ ) 3) Highlight: http://highlight.njit.edu/ 4) CiiRHarchies http://www-ciir.cs.umass.edu/~lawrie/hierarchies/

There's also TripleHop (commercial) -- they seem to have quite a good one, see it in action for example at:

http://www.find.com/

I should say that this is not that much a problem. In our experiment SnakeT clusters +200 snippets taken by ~16 different in ,~2-3 second. If

Well, Lingo is a lot faster -- 200 snippets is less than a second... but it's not the point -- if you compare fraction of a second with milliseconds the usual search engine takes to find/ organize search result it is still an order of magnitude slower, right? That was my point -- adding a clustering engine on top of a search engine will be a problem if your search engine servers already are at, let's say, 50% use ratio. If they're at 2%, it's going to skyrocket to 20%, but you're still in the safe zone.

Do you think that it is possible to plug an interface for other languages?
SnakeT is written in Perl and c.

It is rather a matter of how you implement the above interface than the interface itself. I see a couple of solutions for you:

1. We built Carrot2 with extensibility in mind, so you could reuse Carrot2, use it to plug into Nutch and set up SnakeT as a remote clustering component within the Carrot2 clustering process (it is quite trivial: you accept an XML using HTTP POST query, then return an XML as a result). Of course SnakeT then would suffer because all queries would be routed via the network and XML serialization/ deserialization in Java is quite computationally expensive too. But it is the EASIEST way of plugging in SnakeT into Nutch and I'd be happy to help you out if you want to proceed this way.

2. An alternative is to use Java Native Interface to plug SnakeT directly into the above Java interfaces. You would have to write a Java implementation of that interface and route all calls to the C/ Perl code via JNI. This is quite complex to learn at first, but it is possible and works.

a) you're working with an older implementation of the clustering algorithm; the newer one should be faster (don't know whether it is going to be more accurate, but we hope so)
David, do you have a reference about these algorithms?

I have a reference to the "older" Lingo algorithm, if you're seeking that -- they are on my publications page:

http://www.cs.put.poznan.pl/dweiss/xml/publications/index.xml?lang=en

1. StanisÅaw OsiÅski, Jerzy Stefanowski, Dawid Weiss: Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition. In: Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWMÂ04 Conference, Zakopane, Poland, 2004, pp. 359â368.

2. StanisÅaw OsiÅski, Dawid Weiss: Lingo - Concept-Driven Algorithm for Clustering Search Results. In: IEEE Intelligent Systems, to appear. [ this one I can send you via private communication if you wish ]

b) We don't use external ontologies or knowledge like Vivisimo or SkaneT. I think we should have a way of incorporating them somehow in the future.
I'm not aware of the fact that Vivisimo is using some ontology. They claim the opposite: http://vivisimo.com/products/Overview.html Where did you find this info?

Vivisimo does not _need_ to use any ontology, but they have the mechanisms to incorporate arbitrary ontologies into their clustering algorithm to 'direct' it a little bit. I can't recall where I found this -- might have been from a public press release somewhere or from my private communication with Vivisimo's people. I think they do boost they online results using some ontology though -- they are sometimes too good to be completely based on the search result ;)

SnakeT doesn't use an ontology with the traditional meaning. Instead, we do have a ranked dictionary of approximated pairs of terms. The ranked dictionary is built off line and used on line at clustering time. The rank function is a variant of the tf.idf measure, adapted to exploit the categories. In any case we do not relay on the fixed organization of DMOZ. This is somewhat similar to the approach of http://www.cs.berkeley.edu/~milch/papers/www2003.html

I never meant to claim that you rely on a set of fixed categories -- I just said you use an ontology as an external source of information helpful to organize the results dynamically (which is perfectly reasonable). Good for you that you made it! :)

Any comparison report on clustering algorithms used in carrot2?

Yes, Staszek recently made such a report. I will publish it soon on Carrot2 Web page and let you know.

What is used in your patch for nutch, lingo?


Yes, the first version of Lingo as described in the above paper references.

The current implementation is Lingo in its original form (based on SVD),
but using the new "local interfaces" component binding architecture. Lingo-nmf-km-3 is an abbreviation used for one of the other versions of Lingo, which utilizes Non-negative matrix decomposition. I did not include that component yet because it is still in beta stage.

Anywhere those algorithms are briefly listed?


Nope, not yet. Be patient ;)

Do you have a reference about lingo against STC?

One of the papers (the one from Zakopane) treats about it. The net result is: Lingo is much better both at finding good clusters and at finding good labels.

Does it used SVD? and what are the difference with SHOC?

Lingo does use SVD in its first version. We also experimented with other matrix decompositions to get a good feature reduction and with some successes. When we have it publishable I'll let you know. The difference to SHOC is that they only do matrix decomposition to get the clusters. We do the opposite -- we do matrix decomposition to get a hint at what would good CLUSTER LABELS be, then we pick those good cluster labels and apply them again to the documents set to get cluster content. We call it cluster-label-comes-first approach.

Does it run fast? My personal experience with SVD decomposition is a pain.

Depends how you look at it. STC is faster. SVD performed using Java's routines is fast enough to cluster 100-200 snippets, not more. We also integrated ATLAS libraries (native) and use them transparently if they're available -- the speed increase is about 5-fold, so it was worth it. If you use alternative matrix decompositions, you get an even bigger speed up.

It is a nice mathematical tool but i'm not aware of any fast running implementation.

The theoretical complexity is I think a polynomial of third degree plus some constants in it... so it is quite complex. On the other hand, you don't have to deal with super large matrices in search results clustering, so it seems to work just perfect in this application (at least it seems like this to us). Native implementations of SVD are quite fast -- check out the following:

http://www.netlib.org/lapack/
http://math-atlas.sourceforge.net/

you'll find more references from there.

Cheers,
Dawid

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Search Results Clustering extension proposal (and sample implementation).

Reply via email to