Hi thanks for the suggestion. I am relatively new to Lucene, so I have a few more questions on this implementation. I looked at the source code for Lucene and found the TopDocCollector class. It appears this class derives from the HitCollector class, so I should be able to simply extend TopDocCollector and override the Collect method to simply check to see if I have a document with the base URL already in collected and inserted. Here is my psuedo code changes to the Collect method:
if (score > 0.0f) { // Do something here to get the document base URL (doc.BaseURL) if ((hq.Size() < numHits || score >= minScore) && collectedBaseURLArray.Contains(doc.BaseURL)) { collectedBaseURLArray.Add(doc.BaseURL); totalHits++; hq.Insert(new ScoreDoc(doc, score)); minScore = ((ScoreDoc) hq.Top()).score; // maintain minScore } } Does this make sense? How could I tell the search to use my extended version of the TopDocCollector class? Also, how would I pull the URL from the document inside of the loop above? I didn't see any good documentation anywhere on how to do that. There seems to be little information out there on how to build your own custom collector. Thanks again, Mike Hayri-2 wrote: > > Mike Polzin wrote: >> I am working on building a web search engine and I would like to build a >> reults page similar to what Google does. The functionality I am looking >> to include is what I refer to a "rolling up" sites, meaning that even if >> a particular site (defined by its base URL) has many relevent hits on >> various pages for the searches keywords, that site is only shown once in >> the results listing with a link to the most relevent hit on that site. >> What I do not want is to have one site dominate a search results page. >> >> Does it make sense to just do the search, get the hits list and then >> programatically remove the results which, although they meet the search >> criteria, are not as relevent? Is there a way to do this through queries? >> >> Thanks in advance! >> >> Mike >> >> >> >> > Actually why don't you use another index for the search results. For > example you make a search and get all the results, then index those > documents to a RAMIndex with the matching term. Then retrieve all > documents in the RamIndex based on hits for the search terms once. In > other terms you will filter the documents based on maximum term > similarity. Is it make sense for you? > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447266.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org