Dawid Weiss wrote:Hello everybody,Well i'm answering a month later and this is quite unusual to me, but i was in a place with bad connection. Sorry Dawid. So, for those that are not in the subject:This is a nice thing to have. I will add a list of product to compare: Commercial 1) Vivisimo: http://vivisimo.com/ 2) IBoogie: http://iboogie.tv/ 3) Mooter: http://www.mooter.com/ Accademic: 1) Carrot2: http://sourceforge.net/projects/carrot2/ 2) SnakeT: http://roquefort.di.unipi.it/ (and the new experimental beta, http://roquefort.di.unipi.it:8091/ ) 3) Highlight: http://highlight.njit.edu/ 4) CiiRHarchies http://www-ciir.cs.umass.edu/~lawrie/hierarchies/ There are two recent papers on Web Search Result Clustering. One of Microsoft (SIGIR) and the other of IBM (WWW conf). But the authors neither provide the sw nor a web interface (personal communication). I asked for having the clustering results on a fixed set of queries with no success. This happened two months ago. Other academic initiative like Grouper, Ephemeral Clustering, Shoc and many others no longer available (personal communication). Comparing different Web Search Results clustering is a mess since you have to rely on users survey. 2) The price: clustering is usually resource-consuming, so for high-load services (dozens of queries per second) it is probably not an option (at least with the implementation I am going to provide in a minute). Also, clustering usually needs to be performed on a larger "window" of results than the user actually requested... 10 results is not much to cluster. I've set the 'default' to a hundred snippets, you can adjust it to your needs.I should say that this is not that much a problem. In our experiment SnakeT clusters +200 snippets taken by ~16 different in ,~2-3 second. If you have access to the index you should take let say n items from the N satisfying a query, and to select from them the top-k with an heap k < n < N. So many engines fixes k=10, k << n << N having a suboptimal results. But they need to fetch n results from the inverted index in order to rank the top k. Google v.01 suggested this http://www-db.stanford.edu/~backrub/google.html (section 4.5) Is this scheme used in Nutch/Lucence? Nevertheless, the clustering process is an overhead. 3) The extension API: I've prepared a proposal for an extension API to Nutch that would support 'online clusterers' of the type I mentioned. I've put the clustering API in net.nutch.clustering -- two extension interfaces are in the attachment as patch files, but basically this is the core:Do you think that it is possible to plug an interface for other languages? SnakeT is written in Perl and c. a) you're working with an older implementation of the clustering algorithm; the newer one should be faster (don't know whether it is going to be more accurate, but we hope so)David, do you have a reference about these algorithms? b) We don't use external ontologies or knowledge like Vivisimo or SkaneT. I think we should have a way of incorporating them somehow in the future.I'm not aware of the fact that Vivisimo is using some ontology. They claim the opposite: http://vivisimo.com/products/Overview.html Where did you find this info? SnakeT doesn't use an ontology with the traditional meaning. Instead, we do have a ranked dictionary of approximated pairs of terms. The ranked dictionary is built off line and used on line at clustering time. The rank function is a variant of the tf.idf measure, adapted to exploit the categories. In any case we do not relay on the fixed organization of DMOZ. This is somewhat similar to the approach of http://www.cs.berkeley.edu/~milch/papers/www2003.html Any comparison report on clustering algorithms used in carrot2? What is used in your patch for nutch, lingo? The current implementation is Lingo in its original form (based on SVD), but using the new "local interfaces" component binding architecture. Lingo-nmf-km-3 is an abbreviation used for one of the other versions of Lingo, which utilizes Non-negative matrix decomposition. I did not include that component yet because it is still in beta stage.Anywhere those algorithms are briefly listed? Do you have a reference about lingo against STC? Does it used SVD? and what are the difference with SHOC? Does it run fast? My personal experience with SVD decomposition is a pain. It is a nice mathematical tool but i'm not aware of any fast running implementation. -- "We have no credible evidence that Iraq and Al Qaeda cooperated on attacks against the United States." Staff report of the commission investigating the Sept. 11 attacks. |
- Re: [Nutch-dev] Search Results Clustering extension proposal... Antonio Gulli
