I am a newbie to Lucene, and this is my first serious posting to Lucene-user.
This is to solicit comment upon the problem of supplying a "sponsored links" capability within Lucene. This capability would not affect at all which documents are returned by a query, but would cause any 'sponsored' documents present among the results to be displayed before other documents in the list returned.
I have looked over the correspondence in Lucene-user, but not found anything addressing this topic; if I have missed it, please tell me where and when, and ignore the rest of this.
It seems to me that there are three ways to achieve the capability:
1. Preset boost values for 'sponsored' documents, with an implied burden of reindexing when sponsors are modified.
2. Post-qualify documents present in the hit list for their sponsorship status, building a new hit list.
3. Modify the query to search using both the full query as an unsponsored boolean clause with the default boost value, and for each sponsor, to repeat the full query ANDed with that sponsor with the appropriate boost value.
Are there other strategies not considered?
Assuming a small list of sponsors (10 or fewer), and low volatility amongst the sponsors (1 change / month or less) which method is best?
I have been pursuing method #1, almost to the exclusion of the others, but have encountered an unknown difficulty in the implementation (separate posting). In particular, while it is clear that #3 is doable, I know nothing about the searching burden added by multiplying the user's query by one plus the count of sponsors.
Regarding #3, if my understanding is right, then: Sponsors name: s1, s2, s3 ... words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 ... boost values: s1v, s2v, s3v
then given query q as user input, form: q or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v) or (q and (s2w1 | s2w2 ...)^s2v) or (q and (s3w1 ...)^s3v) Is this correct?
Does the strategy of search identify any kind of intermediate sublist to speed up searching? (But then it would start to resemble #2.)
Rolling ones own for #2 would run query q, and get the HitCollector. Separately running queries for each of: s1w1 | s1w2 | s1w3 | ..., s2w1 | s2w2 ... s3w1 ... and merge each hit collector with the one from query q. (Just AND the bitsets???) Lastly adjust scores and form a new composite HitCollecter. By this time I have told everyone much more than I know.
Stray thought:-- can HitCollectors be cached at application init?
There are many other questions regarding details of implementation, but their proper place is another communication.
Just by preparing this document for dissemination has helped greatly. All and any comments are much appreciated.
Thank you all.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]