Re: Result Relevance (was: Handling Duplicates(

Michael Garski Tue, 22 May 2007 07:56:32 -0700

Correct - I create a new searcher once per hour and re-use the sameinstance until the next instance is created and warmed up an hour later.


Michael



On May 22, 2007, at 6:31 AM, Patrick Burrows wrote:

I think he only has to "warm up" when the webserver comes on linethe first
time.
And, maybe I am misunderstanding (am still new to DotLucene) but afilterlimits the returned results, whereas Relevance refers to the orderof thereturned results. Both concepts may be applicable in a givensearch, butthey don't replace one another. Even if I filter, I still want toorder.
(though I'd be, potentially, sorting a smaller subset).


On 5/22/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:
> What I am doing is reading all of the stored values in the indexfor
every document

I missed this one.

Didn't you mention an index size of 900MB? So you are reading this
completely into memory? Wouldn't a RAMDirectory be an easierchoice then?
I suggested the filter idea since I've got a strong Web/Realtime
background. There's no time for 1 minute warmup during awebrequest - atleast if you want your users to return ;-). Using a filter to sortout allirrelevant documents during search is the fastest way I can thinkof in this
case.

-Erich

________________________________

From: Michael Garski [mailto:[EMAIL PROTECTED]
Sent: Tue 2007-05-22 02:13
To: [email protected]
Subject: Re: Result Relevance (was: Handling Duplicates(
A filter is used to filter your search against a subset of thedocuments
in the index based on the results of a query.

What I am doing is reading all of the stored values in the index for
every document into an array when warming up a searcher. This isa niceperformance win that eliminates duplicate calls to reading storedvaluesout of the document and parsing them into integers (the unique idof the
data in an external database) when returning the results to the user
interface. It takes a minute or so to do this on warm-up, but itdoes
shave time off the execution of each search.

Michael

Erich Eichinger wrote:
> Hi,
>
>
>> during searcher warm up I create an array the length of thedocument
count then walk
>> through each document in the index reading the stored value,parsing
into a number,
>> and caching in the array.
>>
>
> maybe I'm missing something: but isn't a filter nearly doingwhat you
are describing here? Where is the difference - especially regarding
performance?
>
> -Erich
>
> ________________________________
>
> From: Michael Garski [mailto:[EMAIL PROTECTED]
> Sent: Mon 2007-05-21 21:57
> To: [email protected]
> Subject: Re: Result Relevance (was: Handling Duplicates(
>
>
>
> Here is the method I use to alter the relevancy of Lucene's search
> results based on other attributes of a document, while keeping
> performance very high.
>
> At index time, I store a value in the index that will be used toalter> the score, which is computed based on several business logicrules. To> improve performance at search time, during searcher warm up Icreate an> array the length of the document count then walk through eachdocument
> in the index reading the stored value, parsing into a number, and
> caching in the array. In a high-volume system, the repetitiveindex i/o> to read and parse a stored value has a performance penalty butnow I> only need to get the value out of the array with the document idof the
> search hit.
>
> I use a hit collector that I inherited from the TopDocCollector,which> from my experimentation is a big boon for performance when youonly need> the highest scoring results. I have a 9 million document indexthat for> some searches on common terms and phrases can yield over 400,000hits -> only the first few thousand of which are all that relevant andif I try> to use a normal HitCollector with that many hits performancesuffers
> when trying to do a sort to get the top results.  With a collector
> derived from TopDocCollector in the Collect method, callBase.Collect
> with your altered relevancy score and the document id.  As an added
> bonus, the TopDocs return value is already sorted for you.
>
> Hope this can help you,
>
> Michael
>
> Patrick Burrows wrote:
>
>> What about physical storage order? In a traditional RDBMS (likeSQL
>> Server)
>> you could create a clustered index for your table which setsthe order
>> the
>> records are stored on disk.
>>
>> I know a full-text index is not the same thing, so I don't know if
>> there is
>> a similar concept or not.
>>
>> Because any scheme to order the results will not be asefficient as
>> having
>> the results ordered on return. Depending on the number ofresults, this
>> could be an enormous difference.
>>
>>
>>
>> On 5/20/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:
>>
>>> Hi all,
>>>
>>> did anyone ever try to write a custom filter for such a task?This
could
>>> at least reduce the number resulting indexdocs that need to besorted.
>>>
>>> I'm thinking of something like this:
>>>
>>> 1) fetch all dbentity keys matching a certain relevance criteria
("where
>>> popularity > 90")
>>> 2) filter out all indexdocs where the key is not contained inthe list
>>> fetched at step 1)
>>>
>>> of course this assumes that there is some key stored with theindex
>>> to be
>>> able to associate an indexdoc<->dbentity
>>>
>>> just thinking loud,
>>> Erich
>>>
>>>
>>> ________________________________
>>>
>>> From: Digy [mailto:[EMAIL PROTECTED]
>>> Sent: Sun 2007-05-20 00:32
>>> To: [email protected]
>>> Subject: RE: Result Relevance (was: Handling Duplicates(
>>>
>>>
>>>
>>> Hi Patrick,
>>>
>>> I also think that doing a db query for each result can degradethe>>> performance dramatically. Therefore storing relevance factorwithin
the
>>> index is a better idea. But then ,as you say, cost of sortingarises.
To
>>> minimize the cost, the number of hits to return can be limitedto a>>> number(nDocs param of Search method of IndexSearcher). Butthis time,
>>> the
>>> ranking algorithm of lucene may skip out more relevant documents
before
>>> sorting.
>>>
>>> So, I think
>>>        1- making a search without a "nDoc" limitation
>>>        2- Passing on the result set once and collecting the most
>>> relevant
>>> N
>>> results(say 100 or 1000)
>>>        3- Then sorting this results
>>> can be better solution.
>>>
>>> DIGY
>>>
>>>
>>> -----Original Message-----
>>> From: Patrick Burrows [mailto:[EMAIL PROTECTED]
>>> Sent: Saturday, May 19, 2007 6:34 PM
>>> To: [email protected]
>>> Subject: Result Relevance (was: Handling Duplicates(
>>>
>>> Thinking about this more, I don't think doing a second DBlookup for
>>> each
>>> result is going to scale well. It is possible that a singlesearch
>>> returns
>>> tens of thousands of results, the very last one might be the most
>>> relevant.
>>> I am going to have to store the relevancy factors (it is morethan
just
>>> popularity) within the index itself.
>>>
>>> I think I will write something to update the relevancy ratingonce a
>>> week
>>> or
>>> so for each indexed document. Afterall, I don't think Googleupdates
>>> their
>>> PageRank more than once a month or so.
>>>
>>> After that it is just a matter of sorting by that relevancyrating.
>>> Though,
>>> I read on the forums that sorting is a bit of an expensiveprocedure.>>> Someone mentioned 100 searches / sec going down to 10 / sec.Not sure
>>> the
>>> details or the hardware. But that is an order of magnitude
>>> difference, if
>>> those results can be believed.
>>>
>>> Gonna experiment, I guess.
>>>
>>>
>>> On 5/18/07, Michael Garski <[EMAIL PROTECTED]> wrote:
>>>
>>>> Patrick,
>>>>
>>>> I've had to do something very similar, and you have a couple of
>>>>
>>> options:
>>>
>>>> 1. If the 'popularity' value is stored in a database, you canlook up>>>> those values after performing your search against the indexand then
>>>> sort.
>>>>
>>>> 2. Continually update the index to reflect the most recent
>>>> 'popularity' value and then perform a custom sort during yoursearch.
>>>>
>>>> For my application, #2 is what we fond to be most efficient.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On May 18, 2007, at 4:48 AM, Patrick Burrows wrote:
>>>>
>>>>
>>>>> Thanks guys. I'll try it out.
>>>>>
>>>>> My next question is going to be about ranking the results of my
>>>>> searches
>>>>> based on information that is not in the index (popularity, for
>>>>> instance,
>>>>> which might change hourly). Is there some reading I can doon the
>>>>> subject
>>>>> before I start asking questions?
>>>>>
>>>>>
>>>>>
>>>> --
>>>> -
>>>> P
>>>>
>>>
>>>
>>>
>>
>
>
>
>
>
--
-
P

Re: Result Relevance (was: Handling Duplicates(

Reply via email to