You might be interested in the Carrot project, which has some Lucene support. I don't know if it solves your second problem, but it already implements clustering and may allow you to get to an answer for the second problem quicker. I have, just recently, started using it for a clustering task I am working on related to search results. I think the author of Carrot is on the user list from time to time


Marvin Humphrey wrote:
Greets,

I'm toying with the idea of implementing clustering of search results based on comparison of document vectors constrained by field. For instance, you could cluster based on "topic", or "domain", or "content". "domain" would be easy, as it's presumably a single value field. "content" would be much more involved.

The problem I'm trying to solve is how to return a minimum number of clusters from a search. Say the most relevant 100 documents for a query are all from the same domain, but you want a maximum of two results per domain, a la Google. I don't see any alternative to rerunning the query an indeterminate number of times until you've accumulated sufficient clusters, because the search logic doesn't know what cluster a document belongs to until the document vector is retrieved.

Is there a better way?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--

Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University School of Information Studies 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to