Re: Result Relevance (was: Handling Duplicates(

Patrick Burrows Tue, 22 May 2007 06:45:44 -0700

I see.

I am going to set up some experiments...



On 5/22/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:



> And, maybe I am misunderstanding (am still new to DotLucene)
> but a filter limits the returned results

You are completely right on this.

But the problem was, that the relevance criteria is stored outside the
index. Thus may suggestion was to query the database for say the top 4000
relevant document-ids and create a filter that will limit lucene's search to
only those 4000 documents.

-Erich

> -----Original Message-----
> From: Patrick Burrows [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 22, 2007 3:31 PM
> To: [email protected]
> Subject: Re: Result Relevance (was: Handling Duplicates(
>
> I think he only has to "warm up" when the webserver comes on
> line the first time.
>
> And, maybe I am misunderstanding (am still new to DotLucene)
> but a filter limits the returned results, whereas Relevance
> refers to the order of the returned results. Both concepts
> may be applicable in a given search, but they don't replace
> one another. Even if I filter, I still want to order.
> (though I'd be, potentially, sorting a smaller subset).
>
>
> On 5/22/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:
> >
> > > What I am doing is reading all of the stored values in
> the index for
> > every document
> >
> > I missed this one.
> >
> > Didn't you mention an index size of 900MB? So you are reading this
> > completely into memory? Wouldn't a RAMDirectory be an
> easier choice then?
> >
> > I suggested the filter idea since I've got a strong Web/Realtime
> > background. There's no time for 1 minute warmup during a
> webrequest -
> > at least if you want your users to return ;-). Using a
> filter to sort
> > out all irrelevant documents during search is the fastest way I can
> > think of in this case.
> >
> > -Erich
> >
> > ________________________________
> >
> > From: Michael Garski [mailto:[EMAIL PROTECTED]
> > Sent: Tue 2007-05-22 02:13
> > To: [email protected]
> > Subject: Re: Result Relevance (was: Handling Duplicates(
> >
> >
> >
> > A filter is used to filter your search against a subset of the
> > documents in the index based on the results of a query.
> >
> > What I am doing is reading all of the stored values in the
> index for
> > every document into an array when warming up a searcher.  This is a
> > nice performance win that eliminates duplicate calls to
> reading stored
> > values out of the document and parsing them into integers
> (the unique
> > id of the data in an external database) when returning the
> results to
> > the user interface.  It takes a minute or so to do this on warm-up,
> > but it does shave time off the execution of each search.
> >
> > Michael
> >
> > Erich Eichinger wrote:
> > > Hi,
> > >
> > >
> > >> during searcher warm up I create an array the length of the
> > >> document
> > count then walk
> > >> through each document in the index reading the stored value,
> > >> parsing
> > into a number,
> > >> and caching in the array.
> > >>
> > >
> > > maybe I'm missing something: but isn't a filter nearly doing what
> > > you
> > are describing here? Where is the difference - especially regarding
> > performance?
> > >
> > > -Erich
> > >
> > > ________________________________
> > >
> > > From: Michael Garski [mailto:[EMAIL PROTECTED]
> > > Sent: Mon 2007-05-21 21:57
> > > To: [email protected]
> > > Subject: Re: Result Relevance (was: Handling Duplicates(
> > >
> > >
> > >
> > > Here is the method I use to alter the relevancy of
> Lucene's search
> > > results based on other attributes of a document, while keeping
> > > performance very high.
> > >
> > > At index time, I store a value in the index that will be used to
> > > alter the score, which is computed based on several
> business logic
> > > rules.  To improve performance at search time, during
> searcher warm
> > > up I create an array the length of the document count then walk
> > > through each document in the index reading the stored
> value, parsing
> > > into a number, and caching in the array.  In a
> high-volume system,
> > > the repetitive index i/o to read and parse a stored value has a
> > > performance penalty but now I only need to get the value
> out of the
> > > array with the document id of the search hit.
> > >
> > > I use a hit collector that I inherited from the TopDocCollector,
> > > which from my experimentation is a big boon for
> performance when you
> > > only need the highest scoring results.  I have a 9
> million document
> > > index that for some searches on common terms and phrases
> can yield
> > > over 400,000 hits - only the first few thousand of which are all
> > > that relevant and if I try to use a normal HitCollector with that
> > > many hits performance suffers when trying to do a sort to get the
> > > top results.  With a collector derived from
> TopDocCollector in the
> > > Collect method, call Base.Collect with your altered
> relevancy score
> > > and the document id.  As an added bonus, the TopDocs
> return value is already sorted for you.
> > >
> > > Hope this can help you,
> > >
> > > Michael
> > >
> > > Patrick Burrows wrote:
> > >
> > >> What about physical storage order? In a traditional
> RDBMS (like SQL
> > >> Server)
> > >> you could create a clustered index for your table which sets the
> > >> order the records are stored on disk.
> > >>
> > >> I know a full-text index is not the same thing, so I
> don't know if
> > >> there is a similar concept or not.
> > >>
> > >> Because any scheme to order the results will not be as
> efficient as
> > >> having the results ordered on return. Depending on the number of
> > >> results, this could be an enormous difference.
> > >>
> > >>
> > >>
> > >> On 5/20/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> did anyone ever try to write a custom filter for such a
> task? This
> > could
> > >>> at least reduce the number resulting indexdocs that
> need to be sorted.
> > >>>
> > >>> I'm thinking of something like this:
> > >>>
> > >>> 1) fetch all dbentity keys matching a certain relevance criteria
> > ("where
> > >>> popularity > 90")
> > >>> 2) filter out all indexdocs where the key is not
> contained in the
> > >>> list fetched at step 1)
> > >>>
> > >>> of course this assumes that there is some key stored with the
> > >>> index to be able to associate an indexdoc<->dbentity
> > >>>
> > >>> just thinking loud,
> > >>> Erich
> > >>>
> > >>>
> > >>> ________________________________
> > >>>
> > >>> From: Digy [mailto:[EMAIL PROTECTED]
> > >>> Sent: Sun 2007-05-20 00:32
> > >>> To: [email protected]
> > >>> Subject: RE: Result Relevance (was: Handling Duplicates(
> > >>>
> > >>>
> > >>>
> > >>> Hi Patrick,
> > >>>
> > >>> I also think that doing a db query for each result can
> degrade the
> > >>> performance dramatically. Therefore storing relevance factor
> > >>> within
> > the
> > >>> index is a better idea. But then ,as you say, cost of
> sorting arises.
> > To
> > >>> minimize the cost, the number of hits to return can be
> limited to
> > >>> a number(nDocs param of Search method of
> IndexSearcher). But this
> > >>> time, the ranking algorithm of lucene may skip out more
> relevant
> > >>> documents
> > before
> > >>> sorting.
> > >>>
> > >>> So, I think
> > >>>        1- making a search without a "nDoc" limitation
> > >>>        2- Passing on the result set once and collecting
> the most
> > >>> relevant N results(say 100 or 1000)
> > >>>        3- Then sorting this results can be better solution.
> > >>>
> > >>> DIGY
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: Patrick Burrows [mailto:[EMAIL PROTECTED]
> > >>> Sent: Saturday, May 19, 2007 6:34 PM
> > >>> To: [email protected]
> > >>> Subject: Result Relevance (was: Handling Duplicates(
> > >>>
> > >>> Thinking about this more, I don't think doing a second
> DB lookup
> > >>> for each result is going to scale well. It is possible that a
> > >>> single search returns tens of thousands of results, the
> very last
> > >>> one might be the most relevant.
> > >>> I am going to have to store the relevancy factors (it
> is more than
> > just
> > >>> popularity) within the index itself.
> > >>>
> > >>> I think I will write something to update the relevancy
> rating once
> > >>> a week or so for each indexed document. Afterall, I don't think
> > >>> Google updates their PageRank more than once a month or so.
> > >>>
> > >>> After that it is just a matter of sorting by that
> relevancy rating.
> > >>> Though,
> > >>> I read on the forums that sorting is a bit of an
> expensive procedure.
> > >>> Someone mentioned 100 searches / sec going down to 10 /
> sec. Not
> > >>> sure the details or the hardware. But that is an order of
> > >>> magnitude difference, if those results can be believed.
> > >>>
> > >>> Gonna experiment, I guess.
> > >>>
> > >>>
> > >>> On 5/18/07, Michael Garski <[EMAIL PROTECTED]> wrote:
> > >>>
> > >>>> Patrick,
> > >>>>
> > >>>> I've had to do something very similar, and you have a couple of
> > >>>>
> > >>> options:
> > >>>
> > >>>> 1. If the 'popularity' value is stored in a database, you can
> > >>>> look up those values after performing your search against the
> > >>>> index and then sort.
> > >>>>
> > >>>> 2. Continually update the index to reflect the most recent
> > >>>> 'popularity' value and then perform a custom sort
> during your search.
> > >>>>
> > >>>> For my application, #2 is what we fond to be most efficient.
> > >>>>
> > >>>> Michael
> > >>>>
> > >>>>
> > >>>> On May 18, 2007, at 4:48 AM, Patrick Burrows wrote:
> > >>>>
> > >>>>
> > >>>>> Thanks guys. I'll try it out.
> > >>>>>
> > >>>>> My next question is going to be about ranking the
> results of my
> > >>>>> searches based on information that is not in the index
> > >>>>> (popularity, for instance, which might change
> hourly). Is there
> > >>>>> some reading I can do on the subject before I start asking
> > >>>>> questions?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> --
> > >>>> -
> > >>>> P
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
> --
> -
> P
>
>




--
-
P

Re: Result Relevance (was: Handling Duplicates(

Reply via email to