Hi thanks for the suggestion. I am relatively new to Lucene, so I have a few
more questions on this implementation. I looked at the source code for
Lucene and found the TopDocCollector class. It appears this class derives
from the HitCollector class, so I should be able to simply extend
TopDocCollector and override the Collect method to simply check to see if I
have a document with the base URL already in collected and inserted. Here is
my psuedo code changes to the Collect method:
if (score > 0.0f)
{
// Do something here to get the document base URL
(doc.BaseURL)
if ((hq.Size() < numHits || score >= minScore) &&
collectedBaseURLArray.Contains(doc.BaseURL))
{
collectedBaseURLArray.Add(doc.BaseURL);
totalHits++;
hq.Insert(new ScoreDoc(doc, score));
minScore = ((ScoreDoc) hq.Top()).score; // maintain
minScore
}
}
Does this make sense?
How could I tell the search to use my extended version of the
TopDocCollector class? Also, how would I pull the URL from the document
inside of the loop above? I didn't see any good documentation anywhere on
how to do that. There seems to be little information out there on how to
build your own custom collector.
Thanks again,
Mike
Hayri-2 wrote:
>
> Mike Polzin wrote:
>> I am working on building a web search engine and I would like to build a
>> reults page similar to what Google does. The functionality I am looking
>> to include is what I refer to a "rolling up" sites, meaning that even if
>> a particular site (defined by its base URL) has many relevent hits on
>> various pages for the searches keywords, that site is only shown once in
>> the results listing with a link to the most relevent hit on that site.
>> What I do not want is to have one site dominate a search results page.
>>
>> Does it make sense to just do the search, get the hits list and then
>> programatically remove the results which, although they meet the search
>> criteria, are not as relevent? Is there a way to do this through queries?
>>
>> Thanks in advance!
>>
>> Mike
>>
>>
>>
>>
> Actually why don't you use another index for the search results. For
> example you make a search and get all the results, then index those
> documents to a RAMIndex with the matching term. Then retrieve all
> documents in the RamIndex based on hits for the search terms once. In
> other terms you will filter the documents based on maximum term
> similarity. Is it make sense for you?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
--
View this message in context:
http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447266.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]