Dawid Weiss wrote:
Hello everybody,

I think it would be worthwile to somehow standardize, or rather coordinate the effort of "search results clustering" people like Antionio Gulli or myself. I would like to propose an Extension that fits nicely in the Nutch's Plugin system. It is merely two or three interface classes, so it is not much, but it is a start.
Well i'm answering a month later and this is quite unusual to me, but i was in a place with bad connection.
Sorry Dawid.
So, for those that are not in the subject:

1) The objective: dynamically (for each query) group search results into semantically related structure of clusters (ideal case), possibly hierarchical. This should speed up user's comprehension of what kind of topics are present in the result. More and more search engines support clustering. The best product out there is Vivisimo.com's clustering engine, but others are trailing. My project Carrot2 (open source, hosted at Sourceforge) aims at providing a nice and easy to use framework that can help people experiment and develop new clustering algorithms. So far we have a couple algorithms, some of them give decent results.
This is a nice thing to have. I will add a list of product to compare:

Commercial
1) Vivisimo:  http://vivisimo.com/
2) IBoogie: http://iboogie.tv/
3) Mooter: http://www.mooter.com/

Accademic:
1) Carrot2:  http://sourceforge.net/projects/carrot2/
2) SnakeT:  http://roquefort.di.unipi.it/  (and the new experimental beta, http://roquefort.di.unipi.it:8091/ )
3)  Highlight: http://highlight.njit.edu/
4) CiiRHarchies  http://www-ciir.cs.umass.edu/~lawrie/hierarchies/

There are two recent papers on Web Search Result Clustering. One of Microsoft (SIGIR) and the other of IBM (WWW conf). But the authors neither provide the sw nor a web interface (personal communication). I asked for having the clustering results on a fixed set of queries with no success. This happened two months ago. Other academic initiative like Grouper, Ephemeral Clustering, Shoc and many others no longer available (personal communication).
Comparing different Web Search Results clustering is a mess since you have to rely on users survey.

2) The price: clustering is usually resource-consuming, so for high-load services (dozens of queries per second) it is probably not an option (at least with the implementation I am going to provide in a minute). Also, clustering usually needs to be performed on a larger "window" of results than the user actually requested... 10 results is not much to cluster. I've set the 'default' to a hundred snippets, you can adjust it to your needs.
I should say that this is not that much a problem. In our experiment SnakeT clusters +200 snippets taken by ~16 different in ,~2-3 second. If you have access to the index you should take let say n items from the N satisfying a query, and to select from them the top-k with an heap  k < n < N. So many engines fixes k=10,  k << n << N having a suboptimal results.  But they need to  fetch n  results from the inverted index in order to  rank the top k. Google v.01 suggested this
http://www-db.stanford.edu/~backrub/google.html (section 4.5)

Is this scheme used in Nutch/Lucence?

Nevertheless, the clustering process is an overhead.
3) The extension API: I've prepared a proposal for an extension API to Nutch that would support 'online clusterers' of the type I mentioned. I've put the clustering API in net.nutch.clustering -- two extension interfaces are in the attachment as patch files, but basically this is the core:

public HitsCluster [] clusterHits(HitDetails [] hitDetails, String [] descriptions);

Plain and simple. Maybe too simple. We will see. Anyway, I tried to design the extension interface so that we can directly reuse data anyway available during the regular search process (snippets, hit details).

Do you think that it is possible to plug an interface for other languages?
SnakeT is written in Perl and c.

a) you're working with an older implementation of the clustering algorithm; the newer one should be faster (don't know whether it is going to be more accurate, but we hope so)
David, do you have a reference about these algorithms?

b) We don't use external ontologies or knowledge like Vivisimo or SkaneT. I think we should have a way of incorporating them somehow in the future.
I'm not aware of the fact that Vivisimo is using some ontology. They claim the opposite:
http://vivisimo.com/products/Overview.html
Where did you find this info?

SnakeT doesn't use an ontology with the traditional meaning.  Instead, we do have a ranked dictionary of approximated pairs of terms. The ranked dictionary is built off line and used on line at clustering time. The rank function is a variant of the tf.idf measure, adapted to exploit the categories. In any case we do not relay on the fixed organization of DMOZ. This is somewhat similar to the approach of http://www.cs.berkeley.edu/~milch/papers/www2003.html
Any comparison report on clustering algorithms used in
carrot2? What is used in your patch for nutch, lingo?
The current implementation is Lingo in its original form (based on SVD), 
> but using the new "local interfaces" component binding architecture. 
> Lingo-nmf-km-3 is an abbreviation used for one of the other versions of 
> Lingo, which utilizes Non-negative matrix decomposition. I did not 
> include that component yet because it is still in beta stage.
  

Anywhere those algorithms are briefly listed?

  

Do you have a reference about lingo against STC?
Does it used SVD? and what are the difference with SHOC?
Does it run fast? My personal experience with SVD decomposition is a pain.
It is a nice mathematical tool but i'm not aware of any fast running implementation.









-- 
"We have no credible evidence that Iraq and Al Qaeda 
cooperated on attacks against the United States."
Staff report of the commission investigating the Sept. 
11 attacks.

Reply via email to