You might be interested in the Carrot project, which has some Lucene
support. I don't know if it solves your second problem, but it already
implements clustering and may allow you to get to an answer for the
second problem quicker. I have, just recently, started using it for a
clustering task I am working on related to search results. I think the
author of Carrot is on the user list from time to time
Marvin Humphrey wrote:
Greets,
I'm toying with the idea of implementing clustering of search results
based on comparison of document vectors constrained by field. For
instance, you could cluster based on "topic", or "domain", or
"content". "domain" would be easy, as it's presumably a single value
field. "content" would be much more involved.
The problem I'm trying to solve is how to return a minimum number of
clusters from a search. Say the most relevant 100 documents for a
query are all from the same domain, but you want a maximum of two
results per domain, a la Google. I don't see any alternative to
rerunning the query an indeterminate number of times until you've
accumulated sufficient clusters, because the search logic doesn't know
what cluster a document belongs to until the document vector is
retrieved.
Is there a better way?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]