Re: Result Relevance (was: Handling Duplicates(

Patrick Burrows Mon, 21 May 2007 14:35:28 -0700

True. I hadn't done any calculations yet. On my test dataset on my computer,
31,000 documents are taking up 71641514 bytes. Using that to scale, 10
million records will only take about 22GBs.


That's not as bad as I had feared.


On 5/21/07, Digy <[EMAIL PROTECTED]> wrote:


Hi All,

I think Michael has reached to a good solution and it can be expanded to
store the array created at warm up(or some part of it) on the disk when
memory limitations arise.

The IndexSize/document#   ratio is ~100 according to Michael(and I have a
ratio of ~1000 with 15M docs). So index sizes of terabytes means billions
of
documents which is quite unusually.

DIGY.



-----Original Message-----
From: Michael Garski [mailto:[EMAIL PROTECTED]
Sent: Monday, May 21, 2007 11:54 PM
To: [email protected]
Subject: Re: Result Relevance (was: Handling Duplicates(

The data I am indexing is quite small - 9 million documents only creates
a 900 MB index on disk.

I have some larger indexes that are about 8-10 GB.  I've found that the
performance of updating large indexes can be poor, not to mention the
time to optimize them suffers greatly as an optimization operation
essentially re-writes the entire index.  I perform indexing and
searching on separate machines, and prefer maintaining multiple smaller
indexes, then merging them before publishing them to the search server.
The merge operation also acts to optimize the index into a single
segment.  When my individual sub-indexes become too large to merge
together into one large index, I merge them into medium-sized
sub-indexes.  I have my own custom multi-searcher that I use to search
the sub-indexes and merge the results.

I have one index that is comprised of 8 sub-indexes of anywhere from
4-12 GB, which are in turn created from 200 smaller sub-indexes that are
merged once or twice a day.  The total size on disk is 80 GB.   I have
not done much to tune performance on this index as it's not critical -
searches against it are run in batch jobs off-line.

Michael

Patrick Burrows wrote:
> Thanks, Michael. You are, essentially, keeping a seperate, in memory,
> index
> of the relevance results. This is a good idea.
>
> 9 million documents... how large is your index? Have you yet got to a
> point
> where you need to seperate it across machines? I was wondering (and
> ignoring!) future scalability concerns when my index gets to terrabyte
> size.
>
>
> On 5/21/07, Michael Garski <[EMAIL PROTECTED]> wrote:
>>
>> Here is the method I use to alter the relevancy of Lucene's search
>> results based on other attributes of a document, while keeping
>> performance very high.
>>
>> At index time, I store a value in the index that will be used to alter
>> the score, which is computed based on several business logic rules.  To
>> improve performance at search time, during searcher warm up I create an
>> array the length of the document count then walk through each document
>> in the index reading the stored value, parsing into a number, and
>> caching in the array.  In a high-volume system, the repetitive index
i/o
>> to read and parse a stored value has a performance penalty but now I
>> only need to get the value out of the array with the document id of the
>> search hit.
>>
>> I use a hit collector that I inherited from the TopDocCollector, which
>> from my experimentation is a big boon for performance when you only
need
>> the highest scoring results.  I have a 9 million document index that
for
>> some searches on common terms and phrases can yield over 400,000 hits -
>> only the first few thousand of which are all that relevant and if I try
>> to use a normal HitCollector with that many hits performance suffers
>> when trying to do a sort to get the top results.  With a collector
>> derived from TopDocCollector in the Collect method, call Base.Collect
>> with your altered relevancy score and the document id.  As an added
>> bonus, the TopDocs return value is already sorted for you.
>>
>> Hope this can help you,
>>
>> Michael
>>
>> Patrick Burrows wrote:
>> > What about physical storage order? In a traditional RDBMS (like SQL
>> > Server)
>> > you could create a clustered index for your table which sets the
order
>> > the
>> > records are stored on disk.
>> >
>> > I know a full-text index is not the same thing, so I don't know if
>> > there is
>> > a similar concept or not.
>> >
>> > Because any scheme to order the results will not be as efficient as
>> > having
>> > the results ordered on return. Depending on the number of results,
>> this
>> > could be an enormous difference.
>> >
>> >
>> >
>> > On 5/20/07, Erich Eichinger <[EMAIL PROTECTED]> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> did anyone ever try to write a custom filter for such a task? This
>> could
>> >> at least reduce the number resulting indexdocs that need to be
>> sorted.
>> >>
>> >> I'm thinking of something like this:
>> >>
>> >> 1) fetch all dbentity keys matching a certain relevance criteria
>> ("where
>> >> popularity > 90")
>> >> 2) filter out all indexdocs where the key is not contained in the
>> list
>> >> fetched at step 1)
>> >>
>> >> of course this assumes that there is some key stored with the index
>> >> to be
>> >> able to associate an indexdoc<->dbentity
>> >>
>> >> just thinking loud,
>> >> Erich
>> >>
>> >>
>> >> ________________________________
>> >>
>> >> From: Digy [mailto:[EMAIL PROTECTED]
>> >> Sent: Sun 2007-05-20 00:32
>> >> To: [email protected]
>> >> Subject: RE: Result Relevance (was: Handling Duplicates(
>> >>
>> >>
>> >>
>> >> Hi Patrick,
>> >>
>> >> I also think that doing a db query for each result can degrade the
>> >> performance dramatically. Therefore storing relevance factor
>> within the
>> >> index is a better idea. But then ,as you say, cost of sorting
arises.
>> To
>> >> minimize the cost, the number of hits to return can be limited to a
>> >> number(nDocs param of Search method of IndexSearcher). But this
time,
>> >> the
>> >> ranking algorithm of lucene may skip out more relevant documents
>> before
>> >> sorting.
>> >>
>> >> So, I think
>> >>        1- making a search without a "nDoc" limitation
>> >>        2- Passing on the result set once and collecting the most
>> >> relevant
>> >> N
>> >> results(say 100 or 1000)
>> >>        3- Then sorting this results
>> >> can be better solution.
>> >>
>> >> DIGY
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Patrick Burrows [mailto:[EMAIL PROTECTED]
>> >> Sent: Saturday, May 19, 2007 6:34 PM
>> >> To: [email protected]
>> >> Subject: Result Relevance (was: Handling Duplicates(
>> >>
>> >> Thinking about this more, I don't think doing a second DB lookup for
>> >> each
>> >> result is going to scale well. It is possible that a single search
>> >> returns
>> >> tens of thousands of results, the very last one might be the most
>> >> relevant.
>> >> I am going to have to store the relevancy factors (it is more than
>> just
>> >> popularity) within the index itself.
>> >>
>> >> I think I will write something to update the relevancy rating once a
>> >> week
>> >> or
>> >> so for each indexed document. Afterall, I don't think Google updates
>> >> their
>> >> PageRank more than once a month or so.
>> >>
>> >> After that it is just a matter of sorting by that relevancy rating.
>> >> Though,
>> >> I read on the forums that sorting is a bit of an expensive
procedure.
>> >> Someone mentioned 100 searches / sec going down to 10 / sec. Not
sure
>> >> the
>> >> details or the hardware. But that is an order of magnitude
>> >> difference, if
>> >> those results can be believed.
>> >>
>> >> Gonna experiment, I guess.
>> >>
>> >>
>> >> On 5/18/07, Michael Garski <[EMAIL PROTECTED]> wrote:
>> >> >
>> >> > Patrick,
>> >> >
>> >> > I've had to do something very similar, and you have a couple of
>> >> options:
>> >> >
>> >> > 1. If the 'popularity' value is stored in a database, you can
>> look up
>> >> > those values after performing your search against the index and
>> then
>> >> > sort.
>> >> >
>> >> > 2. Continually update the index to reflect the most recent
>> >> > 'popularity' value and then perform a custom sort during your
>> search.
>> >> >
>> >> > For my application, #2 is what we fond to be most efficient.
>> >> >
>> >> > Michael
>> >> >
>> >> >
>> >> > On May 18, 2007, at 4:48 AM, Patrick Burrows wrote:
>> >> >
>> >> > > Thanks guys. I'll try it out.
>> >> > >
>> >> > > My next question is going to be about ranking the results of my
>> >> > > searches
>> >> > > based on information that is not in the index (popularity, for
>> >> > > instance,
>> >> > > which might change hourly). Is there some reading I can do on
the
>> >> > > subject
>> >> > > before I start asking questions?
>> >> > >
>> >> > >
>> >> >
>> >> > --
>> >> > -
>> >> > P
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>
>



--
-
P

Re: Result Relevance (was: Handling Duplicates(

Reply via email to