Re: boosting relevance of certain documents

Otis Gospodnetic Fri, 25 Apr 2008 22:03:09 -0700

If this is really about adjusting score based on field length (didn't follow 
the thread closely), then this sounds like a job for a custom Similarity with a 
custom implementation of lengthNorm method.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Anshum <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Saturday, April 26, 2008 12:32:56 AM
> Subject: Re: boosting relevance of certain documents
> 
> Hi Daniel,
> 
> Just a suggestion, how bout storing an extra field while indexing that has
> the "length" of the document. You could just divide the score of the
> document (change the lucene code) with the length of the document (or
> something on the same lines) while calculating the score. In this manner,
> among 2 docs, the smaller doc with the same score would get preference.
> Do you think that would somehow solve your problem?
> Though again, it involves changing the algo but this would be useful in case
> you have documents that keep on getting updated and you can not afford to
> hard-code the doc preference.
> 
> --
> Anshum Gupta
> 
> 
> On Sat, Apr 26, 2008 at 4:12 AM, Grant Ingersoll 
> wrote:
> 
> > It really depends.  Hand tuning scoring algs for a specific query is very
> > prone to local maxima problems.  In other words, you fix one query and break
> > 50 others.  Sometimes, a good old "configurable" hard code is the way to go.
> >  If you want a specific doc to be #1, make it number one.  You will pull
> > your hair out otherwise.  In Solr, this is handled via the Query Elevation
> > Component, but isn't all that difficult to implement.
> >
> > Likewise, if you have a priori knowledge that a particular document is
> > more important, then give it a relatively large boost during indexing, being
> > aware that Lucene does not offer much granularity in terms of boosts.  In
> > other words, boost it something like 5 or 10 times, instead of 1.1 vs. 1.2.
> >
> > On the other hand, if you are truly seeing broad problems, then you need
> > to build up a set of queries and judgments (ala TREC) or the
> > contrib/benchmark Quality packages.  You might also look at Lucene's
> > Similarity class.  Lucene's length normalization is often less than optimal
> > for certain types of documents (see the IBM Haifa's assessment for the
> > "Million Query" track of TREC on the Lucene Wiki).
> >
> > Cheers,
> > Grant
> >
> >
> > On Apr 25, 2008, at 3:50 PM, Daniel Freudenberger wrote:
> >
> >  Thanks for your response. I already knew that the relevance is based on
> > > the
> > > term frequency but in some cases it's just not what the user expects.
> > > As I already mentioned, "fifa 2003 fifa 03" vs. "fifa 08" is such a case
> > > -
> > > searching for "fifa" would return the "fifa 2003 fifa 03" document first
> > > but
> > > the "fifa 08" document is more important (from the user's point of
> > > view).
> > >
> > > Any suggestions?
> > >
> > > Best regards,
> > > Daniel
> > > -----Original Message-----
> > > From: Jonathan Ariel [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, April 25, 2008 8:11 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: boosting relevance of certain documents
> > >
> > > Ok. So I'm not an expert of the scoring algorithm, but based on tf*idf
> > > you
> > > can tell that the returned document is more relevant because it has more
> > > term frequency.
> > >
> > > Using the explain you can see the following:
> > >
> > > Doc 1
> > > 0.643841 = (MATCH) fieldWeight(searchable:fifa in 0), product of:
> > >  1.0 = tf(termFreq(searchable:fifa)=1)
> > >  1.287682 = idf(docFreq=2)
> > >  0.5 = fieldNorm(field=searchable, doc=0)
> > >
> > > Doc2
> > > 0.68289655 = (MATCH) fieldWeight(searchable:fifa in 1), product of:
> > >  1.4142135 = tf(termFreq(searchable:fifa)=2)
> > >  1.287682 = idf(docFreq=2)
> > >  0.375 = fieldNorm(field=searchable, doc=1)
> > >
> > > On Fri, Apr 25, 2008 at 2:30 PM, Daniel Freudenberger <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > >  I'm using the StandardAnalyzer - hope this answers your question (I'm
> > > > quite
> > > > new to the lucene thing)
> > > >
> > > > -----Original Message-----
> > > > From: Jonathan Ariel [mailto:[EMAIL PROTECTED]
> > > > Sent: Friday, April 25, 2008 6:59 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: boosting relevance of certain documents
> > > >
> > > > How are you analyzing the searchable field?
> > > >
> > > > On Fri, Apr 25, 2008 at 12:49 PM, Daniel Freudenberger <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > >  Hello,
> > > > >
> > > > >
> > > > >
> > > > > I'm using lucene within a new project and I'm not sure about how to
> > > > >
> > > > solve
> > > >
> > > > > the following problem: My index consists of the two attributes "id"
> > > > > and
> > > > > "searchable". "id" is the id of a product and "searchable" is a
> > > > > combination
> > > > > of the product name and its category name.
> > > > >
> > > > >
> > > > >
> > > > > example:
> > > > >
> > > > > id     searchable
> > > > >
> > > > > 1     fifa 08 - playstation 3
> > > > >
> > > > > 2     fifa 2003 fifa 03 - playstation 3
> > > > >
> > > > > 3     playstation 60gb hdd - playstation 3
> > > > >
> > > > > 4     playstation i like you - playstation 3
> > > > >
> > > > >
> > > > >
> > > > > When searching for "fifa", lucene returns the product with id 2 at
> > > > >
> > > > first,
> > > >
> > > > > whereas id 1 ("fifa 08") would be the much more relevant result
> > > > > (from
> > > > >
> > > > the
> > > >
> > > > > user side of view). the same problem arises when searching for
> > > > > "playstation"
> > > > > - the customer expects products having "playstation" in their names
> > > > > at
> > > > > first, ideally the console itself. in reality however, he gets all
> > > > > possible
> > > > > products which are in the "playstation" category as well.
> > > > >
> > > > >
> > > > >
> > > > > my idea was to introduce another attribute relevance, which may
> > > > > increase
> > > > > the
> > > > > relevance of an entry. the actual relevance shouldn't be suppressed
> > > > > completely though, but should only be taken into account with
> > > > > products
> > > > > that
> > > > > are similarly relevant for a specific search term.
> > > > >
> > > > >
> > > > >
> > > > > Does anybody have an idea on how to solve this problem?
> > > > >
> > > > >
> > > > >
> > > > > Thank you in advance,
> > > > >
> > > > > Daniel
> > > > >
> > > > >
> > > > >
> >



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: boosting relevance of certain documents

Reply via email to