Re: Adding custom weights to individual terms

lukai Thu, 13 Feb 2014 14:14:11 -0800

Hi, Rune:
  Per your requirement, you can generate a separated filed for the document
before send document to lucene. Let's say the name is: score_field. The
content of this field in this way:
 Doc 1#score_field:
  Lucence:0.7 is:0 ...
Doc 2#score_field:
  Lucene:0.5 is:0 ...


 Store the field with "indexed", store other fields as "stored". And store
the weight value as payload for terms(wrap your ananlyzer to consume the
weight value, basically you can leverage: DelimitedPayloadTokenFilter and
WhitespaceTokenizer to form a basic analyzer which can take the input
format). Make sure the term in each document in score_field is unique
(according your description it's already fullfilled). You can also disable
to index the position information for this filed, cuz you dont need it.

Then when you do query:
1. If you want to do score like a cosine similarity based on query and
document, you should implement a query parser to parse weight you assigned
in different terms in query phrase.
2. create a new query type and customize you score function and tell lucene
to use your scorer.

  Here is a small snippet of a query type i had created before, basically
you can follow this logic to manipulate your score value:

         final Terms terms = fields.terms(fieldName);

              if(terms != null ){

                final TermsEnum termsEnum = terms.iterator(null);

                BytesRef bytes = new BytesRef(wandTerm.queryTerm);

                if(termsEnum.seekExact(new BytesRef(wandTerm.queryTerm))){



                  float ub = termsEnum.maxFeatureValue();

                  int docFreq = termsEnum.docFreq();

              //    logger.warn("term:"+wandTerm.queryTerm +"   :" + ub);

                  DocsAndPositionsEnum docsPositionEnum =
termsEnum.docsAndPositions(acceptDocs, null);


tts.add(newWandPosting(fieldName,bytes,docsPositionEnum,ub,wandTerm.
featureValue,(totalDocNum+1)*1.0f/docFreq ));

                }



On Thu, Feb 13, 2014 at 10:49 AM, Rune Stilling <s...@rdfined.dk> wrote:

> I'm not sure how I would do that, when Lucene is meant to use my custom
> weights when calculating document weights when executing a search query.
>
> Doc 1
> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> API(0.3)
>
> Doc 2
> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>
> Query
> Lucene
>
> 0.7 and 0.5 are my custom weight and should be used to return Doc 1 with
> weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.
>
> /Rune
>
> Den 13/02/2014 kl. 13.27 skrev Shai Erera <ser...@gmail.com>:
>
> > I often prefer to manage such weights outside the index. Usually managing
> > them inside the index leads to problems in the future when e.g the
> weights
> > change. If they are encoded in the index, it means re-indexing. Also, if
> > the weight changes then in some segments the weight will be different
> than
> > others. I think that if you manage the weights e.g. in a simple FST
> (which
> > is very compat), it will give you the best flexibility and it's very easy
> > to use.
> >
> > Shai
> >
> >
> > On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> You could stuff your custom weights into a payload, and index that,
> >> but this is per term per document per position, while it sounds like
> >> you just want one float for each term regardless of which
> >> documents/positions where that term occurred?
> >>
> >> Doing your own custom attribute would be a challenge: not only must
> >> you create & set this attribute during indexing, but you then must
> >> change the indexing process (custom chain, custom codec) to get the
> >> new attribute into the index, and then make a custom query that can
> >> pull this attribute at search time.
> >>
> >> What are these term weights?  Are you sure you can't compute these
> >> weights at search time with a custom similarity using the stats that
> >> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <s...@rdfined.dk> wrote:
> >>> Hi list
> >>>
> >>> I'm trying to figure out how customizable scoring and weighting is in
> >> the Lucene API. I read about the API's but still can't figure out if the
> >> following is possible.
> >>>
> >>> I would like to do normal document text indexing, but I would like to
> >> control the weight added to tokens my self, also I would like to control
> >> the weighting of query tokens and the how things are added together.
> >>>
> >>> When indexing a word I would like attache my own weights to the word,
> >> and use these weights when querying for documents. F.ex.
> >>>
> >>> Doc 1
> >>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> >> API(0.3)
> >>>
> >>> Doc 2
> >>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
> >>>
> >>> The floats in parentheses are some I would like to add in the indexing
> >> process, not something coming from Lucene tdf/id ex.
> >>>
> >>> Wen querying I would like to repeat this and also create the weights
> for
> >> each term "myself" and control how the final doc score is calculated.
> >>>
> >>> I have read that it's possible to attach your own custom attributes to
> >> tokens. Is this the way to go? Ie. should I add my custom weight as
> >> attributes to tokens, and then access these attributes when calculating
> >> document score in the search process (described here
> >>
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder"adding
>  a custom attribute")?
> >>>
> >>> The reason why I'm asking is that I can't find any examples of this
> >> being done anywhere. But I found someone stating "With Lucene, it is
> >> impossible to increase or decrease the weight of individual terms in a
> >> document".
> >>>
> >>> With regards
> >>> Rune
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Adding custom weights to individual terms

Reply via email to