'Sponsored' links

Daniel B. Davis Sun, 15 Feb 2004 12:43:43 -0800

I am a newbie to Lucene, and this is my first serious posting
to Lucene-user.

This is to solicit comment upon the problem of supplying
a "sponsored links" capability within Lucene. This capability
would not affect at all which documents are returned by a query,
but would cause any 'sponsored' documents present among the
results to be displayed before other documents in the list
returned.

I have looked over the correspondence in Lucene-user, but
not found anything addressing this topic; if I have missed it,
please tell me where and when, and ignore the rest of this.

It seems to me that there are three ways to achieve the
capability:

1. Preset boost values for 'sponsored' documents, with an
   implied burden of reindexing when sponsors are modified.

2. Post-qualify documents present in the hit list for their
   sponsorship status, building a new hit list.

3. Modify the query to search using both the full query as
   an unsponsored boolean clause with the default boost value,
   and for each sponsor, to repeat the full query ANDed with
   that sponsor with the appropriate boost value.

Are there other strategies not considered?

Assuming a small list of sponsors (10 or fewer), and low
volatility amongst the sponsors (1 change / month or less)
which method is best?

I have been pursuing method #1, almost to the exclusion of
the others, but have encountered an unknown difficulty in the
implementation (separate posting).  In particular, while it is clear
that #3 is doable, I know nothing about the searching burden
added by multiplying the user's query by one plus the count of
sponsors.

Regarding #3, if my understanding is right, then:
   Sponsors name: s1, s2, s3 ...
            words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 ...
            boost values: s1v, s2v, s3v

   then given query q as user input, form:
            q
            or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v)
            or (q and (s2w1 | s2w2 ...)^s2v)
            or (q and (s3w1 ...)^s3v)
Is this correct?

Does the strategy of search identify any kind of intermediate
sublist to speed up searching? (But then it would start to
resemble #2.)

Rolling ones own for #2 would run query q, and get the
HitCollector. Separately running queries for each of:
            s1w1 | s1w2 | s1w3 | ...,
            s2w1 | s2w2 ...
            s3w1 ...
and merge each hit collector with the one from query q.
(Just AND the bitsets???) Lastly adjust scores and form
a new composite HitCollecter.  By this time I have told
everyone much more than I know.

Stray thought:-- can HitCollectors be cached at application init?

There are many other questions regarding details of implementation,
but their proper place is another communication.

Just by preparing this document for dissemination has helped
greatly.  All and any comments are much appreciated.

Thank you all.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

'Sponsored' links

Reply via email to