Re: Result Relevance (was: Handling Duplicates(

Michael Garski Mon, 21 May 2007 12:58:31 -0700

Here is the method I use to alter the relevancy of Lucene's searchresults based on other attributes of a document, while keepingperformance very high.

At index time, I store a value in the index that will be used to alterthe score, which is computed based on several business logic rules. Toimprove performance at search time, during searcher warm up I create anarray the length of the document count then walk through each documentin the index reading the stored value, parsing into a number, andcaching in the array. In a high-volume system, the repetitive index i/oto read and parse a stored value has a performance penalty but now Ionly need to get the value out of the array with the document id of thesearch hit.

I use a hit collector that I inherited from the TopDocCollector, whichfrom my experimentation is a big boon for performance when you only needthe highest scoring results. I have a 9 million document index that forsome searches on common terms and phrases can yield over 400,000 hits -only the first few thousand of which are all that relevant and if I tryto use a normal HitCollector with that many hits performance sufferswhen trying to do a sort to get the top results. With a collectorderived from TopDocCollector in the Collect method, call Base.Collectwith your altered relevancy score and the document id. As an addedbonus, the TopDocs return value is already sorted for you.


Hope this can help you,

Michael

Patrick Burrows wrote:

What about physical storage order? In a traditional RDBMS (like SQLServer)you could create a clustered index for your table which sets the orderthe

records are stored on disk.

I know a full-text index is not the same thing, so I don't know ifthere is

a similar concept or not.

Because any scheme to order the results will not be as efficient ashaving

the results ordered on return. Depending on the number of results, this
could be an enormous difference.



On 5/20/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:


Hi all,

did anyone ever try to write a custom filter for such a task? This could
at least reduce the number resulting indexdocs that need to be sorted.

I'm thinking of something like this:

1) fetch all dbentity keys matching a certain relevance criteria ("where
popularity > 90")
2) filter out all indexdocs where the key is not contained in the list
fetched at step 1)

of course this assumes that there is some key stored with the indexto be

able to associate an indexdoc<->dbentity

just thinking loud,
Erich


________________________________

From: Digy [mailto:[EMAIL PROTECTED]
Sent: Sun 2007-05-20 00:32
To: [email protected]
Subject: RE: Result Relevance (was: Handling Duplicates(



Hi Patrick,

I also think that doing a db query for each result can degrade the
performance dramatically. Therefore storing relevance factor within the
index is a better idea. But then ,as you say, cost of sorting arises. To
minimize the cost, the number of hits to return can be limited to a

number(nDocs param of Search method of IndexSearcher). But this time,the

ranking algorithm of lucene may skip out more relevant documents before
sorting.

So, I think
       1- making a search without a "nDoc" limitation

2- Passing on the result set once and collecting the mostrelevant

N
results(say 100 or 1000)
       3- Then sorting this results
can be better solution.

DIGY


-----Original Message-----
From: Patrick Burrows [mailto:[EMAIL PROTECTED]
Sent: Saturday, May 19, 2007 6:34 PM
To: [email protected]
Subject: Result Relevance (was: Handling Duplicates(

Thinking about this more, I don't think doing a second DB lookup foreachresult is going to scale well. It is possible that a single searchreturns

tens of thousands of results, the very last one might be the most
relevant.
I am going to have to store the relevancy factors (it is more than just
popularity) within the index itself.

I think I will write something to update the relevancy rating once aweek

or

so for each indexed document. Afterall, I don't think Google updatestheir

PageRank more than once a month or so.

After that it is just a matter of sorting by that relevancy rating.
Though,
I read on the forums that sorting is a bit of an expensive procedure.

Someone mentioned 100 searches / sec going down to 10 / sec. Not surethedetails or the hardware. But that is an order of magnitudedifference, if

those results can be believed.

Gonna experiment, I guess.


On 5/18/07, Michael Garski <[EMAIL PROTECTED]> wrote:
>
> Patrick,
>

> I've had to do something very similar, and you have a couple ofoptions:

>
> 1. If the 'popularity' value is stored in a database, you can look up
> those values after performing your search against the index and then
> sort.
>
> 2. Continually update the index to reflect the most recent
> 'popularity' value and then perform a custom sort during your search.
>
> For my application, #2 is what we fond to be most efficient.
>
> Michael
>
>
> On May 18, 2007, at 4:48 AM, Patrick Burrows wrote:
>
> > Thanks guys. I'll try it out.
> >
> > My next question is going to be about ranking the results of my
> > searches
> > based on information that is not in the index (popularity, for
> > instance,
> > which might change hourly). Is there some reading I can do on the
> > subject
> > before I start asking questions?
> >
> >
>
> --
> -
> P

Re: Result Relevance (was: Handling Duplicates(

Reply via email to