Re: Index-time boosting: Deprecated setBoost method

baris . kazar Mon, 21 Oct 2019 12:16:21 -0700

Hi,-

Thanks.


 lets apply to this case:

QueryParser parser = new QueryParser("field1", analyzer) ;
parser.setPhraseSlop(2);
Query query = parser.parse("some string value here"+"*");
TopDocs hits = indexsearcherObject.search(query, 10);

Now i want to use BoostQuery

QueryParser parser = new QueryParser("field1", analyzerObject) ;
parser.setPhraseSlop(2);
Query query = parser.parse("some string value here"+"*");

BoostQuery bq = new BoostQuery(query, "2.0f");

TopDocs hits = indexsearcherObject.search(bq, 10);


Now how will i process field2 with boost value 1.0f?

Before, this was being done at index time.


i can see the only way here is the BooleanQuery which combines

the first boostquery object bq and another one that i need to define forbq2 for field2.


is there any other way?

Best regards



On 10/21/19 2:33 PM, Uwe Schindler wrote:

Hi Boris,

That is ok, and i can see this case would be best with BoostQuery and
also i dont have to use lucene expression jar and its dependents.

However, i am curious how to do this kind of field based boosting at
index time even though i will prefer the query time boosting methodology.

The reason why it was deprecated is exactly the problem I mentioned before: It 
did never do what the user expected. The boost factor given in the document's 
field was multiplied into the per document norms. Unfortunately, at the same 
time, he query normalization was using query statistics and normalized the 
scores. As Lucene is working per field, the same normalization is done per 
field, resulting in the constant factor per field to disappear. There was still 
some effect of index time boosting if different documents had different values, 
but it your case all is the same. I am not sure how your queries worked before, 
but the constant boost factors per field at index time did definitely not have 
the effect you were thinking of. Since the earliest version of Lucene, boosting 
at query time was the way to go to have different weights per field.

The new feature in Lucene is now that you can change the score per document 
using docvalues and apply that per document at query time. Previously this was 
also possible with Document/Field#setBoost, but the flexibility was missing 
(only multiplying and limited precision). In addition the normalization effects 
made the whole thing not reliable.

Uwe

Best regards


On 10/21/19 12:54 PM, Uwe Schindler wrote:

Hi,

As I said, before that is a misuse of index-time boosting. In addition in

previous versions it did not even work correctly, because of query
normalization it was normalized away anyways. And on top, to change it
your have to reindex.

What you intend to do is a typical use case for query time boosting with

BoostQuery. That is explained in almost every book about search, like those
about Solr or Elasticsearch.

Most query parsers also allow to also add boost factors for fields, e.g.

SimpleQueryParser (for humans that need simple syntax without fields).
There you give a list of fields and boost factors.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://urldefense.proofpoint.com/v2/url?u=https-

3A__www.thetaphi.de&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr
MUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-
BKNeyLlULCbaezrgocEvPhQkl4&m=r7LRZQV82ywkycV4mBw1baHDKxar0wnm
JtLLTiUC0wI&s=Zj32e0QqmZFvPbBlD8DPeh7KHYfOgQr89wvmaRvy_n8&e=

eMail: u...@thetaphi.de

-----Original Message-----
From: baris.ka...@oracle.com <baris.ka...@oracle.com>
Sent: Monday, October 21, 2019 6:45 PM
To: java-user@lucene.apache.org
Cc: baris.kazar <baris.ka...@oracle.com>
Subject: Re: Index-time boosting: Deprecated setBoost method

Hi,-

Thanks and i appreciate the disccussion.

Let me please  ask this way, i think i give too much info at one time:

Currently i have this:

  Field  f1= new TextField("field1", "string1", Field.Store.YES); 

doc.add(f1);  f1.setBoost(2.0f);  

Field f2 = new TextField("field2", "string2", Field.Store.YES); 

doc.add(f2); 

f2.setBoost(1.0f);  

But this fails with Lucene 7.7.2.

Probably it is more efficient and more flexible to fix this by using
BoostQuery.

However, what could be the fix with index time boosting? the code in my
previous post was trying to do that.

Best regards

On 10/21/19 12:34 PM, Uwe Schindler wrote:

Hi,

sorry I don't fully understand what you intend to do? If the boost values

per field are static and used with exactly same value for every document,

it's

not needed a index time. You can just boost the field on the query side

(e.g.

using BoostQuery). Boosting every document with the same static values

is

an anti-pattern, that's something better suited for the query side - as you

are

more flexible.

If you need a different boost value per document, you can save that

boost

value in the index per document using a docvalues field (this consumes

extra

space, of course). Then you need the ExpressionQuery on the query side.

But

just because it looks like Javascript, it's not slow. The syntax is compiled to
bytecode and directly included into the query execution as a dynamic java
class, so it's very fast.

In short:
- If you need to have a different boost factor per field name that's

constant

for all documents, apply it at query time with BoostQuery.

- If you have to boost specific documents (e.g., top selling products),

index

a numeric docvalues field per document. On the query side you can use
different query types to modify the score of each result based on the
docvalues field. That can be done with Expression modules (using

compiled

Javascript) or by another query in Lucene that operates on ValueSource

(e.g.,

FunctionQuery). The first one is easier to use for complex formulas.4

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://urldefense.proofpoint.com/v2/url?u=https-

3A__www.thetaphi.de&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr

MUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=70RoM6loHhMGsp95phVzGQf8w5JxW7gX

T5XnleMKrOs&s=td7cUfd22mXljSuvkUPXDunkIs_eO4GxdvHHxD2CTk0&e=

eMail: u...@thetaphi.de

-----Original Message-----
From: baris.ka...@oracle.com <baris.ka...@oracle.com>
Sent: Monday, October 21, 2019 5:17 PM
To: java-user@lucene.apache.org
Cc: baris.kazar <baris.ka...@oracle.com>
Subject: Re: Index-time boosting: Deprecated setBoost method

Hi,-

Sorry about the missing parts in previous post. please accept my
apologies for that.

i needed to add a few more questions/corrections/additions to the
previous post:

Main Question was: if boost is a single constant value, do we need the
Javascript part below?



=== Indexing code snippet for Lucene version 6.6.0 and before===

Document doc = new Document();


  Field  f1= new TextField("field1", "string1", Field.Store.YES); 

doc.add(f1);  f1.setBoost(2.0f);  

Field f2 = new TextField("field2", "string2", Field.Store.YES); 

doc.add(f2); 

f2.setBoost(1.0f);  

=== end of indexing code snippet for Lucene version 6.6.0 and before

===


This turns into this where _boost1 field is associated with field1 and

_boost2 field is associated with field2 field:


In Indexing code:

=== begining of indexing code snippet ===
Field  f1= new TextField("field1", "string1", Field.Store.YES); 

Field _boost1 = new NumericDocValuesField(“field1”, 2L);
doc.add(_boost1);

// If this boost value needs to be stored, a separate storedField
instance needs to be added as well
… ( i will post this soon)

Field _boost2 = new NumericDocValuesField(“field2”, 1L);
doc.add(_boost2);

// If this boost value needs to be stored, a separate storedField
instance needs to be added as well
… ( i will post this soon)

=== end of indexing code snippet ===


Now, in the searching code (i.e., at query time) should i need the
FunctionScoreQuery because in this case

the boost is just a constant value but not a function? However, constant
value can be argued to be a function with the same value all the time,

too.


== begining of query time code snippet ===
Expression expr = JavascriptCompiler.compile(“_boost1 + _boost2");

  // SimpleBindings just maps variables to SortField instances 

SimpleBindings bindings = new SimpleBindings(); 

bindings.add(new SortField("_boost1", SortField.Type.LONG));   //

These

have to LONG type i think since NumericDocValuesField accepts "long"
type only, am i right? Can this be DOUBLE type?

bindings.add(new SortField("_boost2", SortField.Type.LONG));   //

same

question here

// create a query that matches based on body:contents but 

// scores using expr 

Query query = new FunctionScoreQuery( 

        new TermQuery(new Term("field1", "term_to_look_for")), 

expr.getDoubleValuesSource(bindings));

 searcher.search(query, 10);

=== end of code snippet ===


Best regards


On 10/21/19 11:05 AM, baris.ka...@oracle.com wrote:

Hi,-

    i would like to ask the following to make it clearer (for me at least):

Document doc = new Document();

  Field  f1= new TextField("field1", "string1", Field.Store.YES); 

doc.add(f1);  f1.setBoost(2.0f);  

Field f2 = new TextField("field2", "string2", Field.Store.YES); 

doc.add(f2); 

f2.setBoost(1.0f);  


This turns into this where _boost1 field is associated with field1 and

_boost2 field is associated with field2 field:


In Indexing code:

Field  f1= new TextField("field1", "string1", Field.Store.YES); 

Field _boost1 = new NumericDocValuesField(“field1”, 2L);
doc.add(_boost1);

// If this boost value needs to be stored, a separate storedField
instance needs to be added as well
… ( i will post this soon)

Field _boost2 = new NumericDocValuesField(“field2”, 1L);
doc.add(_boost2);

// If this boost value needs to be stored, a separate storedField
instance needs to be added as well
… ( i will post this soon)


Now, in the searching code (i.e., at query time) should i need the
FunctionScoreQuery because in this case

the boost is just a constant value but not a function? However,
constant value can be argued to be a function with the same value all
the time, too.


Expression expr = JavascriptCompiler.compile(“_boost");

  // SimpleBindings just maps variables to SortField instances 

SimpleBindings bindings = new SimpleBindings(); 

bindings.add(new SortField("_boost1", SortField.Type.SCORE));   

// create a query that matches based on body:contents but 

// scores using expr 

Query query = new FunctionScoreQuery( 

       new TermQuery(new Term("field1", "term_to_look_for")), 

expr.getDoubleValuesSource(bindings));

 searcher.search(query, 10);


So, if boost is a single constant value, do we need the Javascript
part above?

Best regards


On 10/18/19 4:07 PM, baris.ka...@oracle.com wrote:

Uwe,-

    can this
https://urldefense.proofpoint.com/v2/url?u=https-

3A__lucene.apache.org_core_7-5F7-

5F2_expressions_org_apache_lucene_expressions_Expression.html&d=DwID
aQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdI

bQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=MR2S9Z9HEge6s665mtGOFRHKGmuiVYkjp

4tXOciYl7A&s=tMCjb5H5KivfJsp-BfABonpjelgp6hn9cBg2GScCmic&e=

doc example that You also gave be extended with

NumericDocValuesField

part that needs to be done at indexing time boosting, too?

i see now why You meant that this is mixed type of boosting (i.e.,
both indexing time and search time).

I need then include this query mentioned in this example on these
_score field (i would call it _boost field in my case) into my
overall BooleanQuery.

i will now try to combine these together and post here for future

help.

Best regards


On 10/18/19 3:18 PM, Uwe Schindler wrote:

Hi,

Read my original email! The index time values are written using
NumericDocValuesField. The expressions docs also refer to that

when

the bindings are documented.

It's separate from the indexed data (TextField). Think of it like an
additional numeric field in your database table with a factor in
each row.

Uwe

Am October 18, 2019 7:14:03 PM UTC schrieb

baris.ka...@oracle.com:

Uwe,-

Two questions there:

i guess this is applicable to TextField, too.

And i was expecting a index writer object in the example for index
time

boosting.

Best regards


On 10/18/19 2:57 PM, Uwe Schindler wrote:

Sorry I was imprecise. It's a mix of both. The factors are stored

per

document in index (this is why I called it index time). During query
time the expression use the index time values to fold them into the
query boost at query time.

What's your problem with that approach?

Uwe

Am October 18, 2019 6:50:40 PM UTC schrieb

baris.ka...@oracle.com:

Uwe,-

      Thanks, if possible i am looking for a pure Java methodology
to do

the

index time boosting.

This example looks like a search time boosting example:

https://urldefense.proofpoint.com/v2/url?u=https-

3A__lucene.apache.org_core_7-5F7-

5F2_expressions_org_apache_lucene_expressions_Expression.html&d=DwIF
aQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdI

bQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLS

vGjEXbyw&s=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8&e=

Best regards

On 10/18/19 2:31 PM, Uwe Schindler wrote:

Hi,

Is there a working example for this? Is this mentioned in the

Lucene

Javadocs or any other docs so that i can look it?

To index the docvalues, see NumericDocValuesField (it can be

added

to

documents like indexed or stored fields). You may have used

them

for

sorting already.

this methodology seems sort of like discouraging using index

time

boosting.

Not really. Many use this all the time. It's one of the killer

features of both Solr and Elasticsearch. The problem was how

the

Document.setBoost()worked (it did not work correctly, see

below).

Previous setBoost method call was fine and easy to use.
Did it have some performance issues and then is that why it

was

deprecated?

No the reason for deprecating this was for several reasons:

setBoost

was not doing what the user had expected. Internally the boost

value

was just multiplied into the document norm factor (which is

internally

also a docvalues field). The norm factors are only very inprecise
floats stored in a byte, so precision is not well. If you put some
values into it and the length norm was already consuming all

bits,

the

boosting was very coarse. It was also only multiplied into and

most

users want to do some stuff like record click counts in the index

and

then boost for example with the logarithm or some other

function.

If

the boost is just multiplied into the length norm you have no
flexibility at all.

In addition you can have several docvalues fields and use their

values in a function (e.g. one field with click count and another

one

with product price). After that you can combine click count and

price

(which can be modified indipenently during index updates) and

change

boost to boost lower price and higher click count up.

This is what you can do with the expressions module. You just

give

it

a function.

Here is an example, the second example is using a

FunctionScoreQuery

that modifies the score based on the function and the given

docvalues:
https://urldefense.proofpoint.com/v2/url?u=https-

3A__lucene.apache.org_core_7-5F7-

5F2_expressions_org_apache_lucene_expressions_Expression.html&d=DwIF
aQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdI

bQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLS

vGjEXbyw&s=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8&e=

FunctionScoreQuery usage with MultiFieldQueryParser would

also

be

nice

where

MultiFieldQuery already has boosts field to do this in its

constructor.

The boots in the query parser are applied for fields during

query

time (to have a different weight per field). Index time boosting is

per

document. So you can combine both.

Maybe it is not needed with MultiFieldQueryParser.

You use MultiFieldQueryParser to adjust weights of the fields

(e.g.

title versus body). The parsed query is then wrapped with an

expression

that modifies the score per document according to the

docvalues.

Uwe

On 10/18/19 1:28 PM, Uwe Schindler wrote:

Hi,

that's not true. You can do index time boosting, but you

need

to

do

that

using a separate field. You just index a numeric docvalues

field

(which may

contain a long or float value per document). Later you wrap

your

query with

some FunctionScoreQuery (e.g., use the Javascript function

query

syntax in

the expressions module). This allows you to compile a

javascript

function

that calculated the final score based on the score returned by

the

inner query

and combines them with docvalues that were indexed per

document.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://urldefense.proofpoint.com/v2/url?u=https-

3A__www.thetaphi.de&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr

MUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6rVk8db2H8dAcjS3WCWmAPd08C7JQCvZ
8W80yE9L5xY&s=zgKmnmP9gLG4DlEnAfDdtBMEzPXtHNVYojxXIKEnQgs&e=

eMail: u...@thetaphi.de

-----Original Message-----
From: baris.ka...@oracle.com <baris.ka...@oracle.com>
Sent: Friday, October 18, 2019 5:28 PM
To: java-user@lucene.apache.org
Cc: baris.ka...@oracle.com
Subject: Re: Index-time boosting: Deprecated setBoost

method

It looks like index-time boosting (field) is not possible since

Lucene

version 7.7.2 and

i was using before for another case the BoostQuery at

search

time

for

boosting and

this seems to be the only boosting option now in Lucene.

Best regards


On 10/18/19 10:01 AM, baris.ka...@oracle.com wrote:

Hi,-

i saw this in the Field class docs and i am figuring out the

following

note in the docs:

setBoost(float boost)
Deprecated.
Index-time boosts are deprecated, please index index-

time

scoring

factors into a doc value field and combine them with the

score

at

query time using eg. FunctionScoreQuery.

I appreciate this note. Is there an example about this? I

wish

docs

would give a simple example to further help.

https://urldefense.proofpoint.com/v2/url?u=https-

3A__lucene.apache.org_core_6-5F6-

5F0__core_org_apache_lucene_document_&d=DwIFaQ&c=RoP1YumCXCga

WHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6rVk8db2H8dAcjS3WCWmAPd08C7JQCvZ
8W80yE9L5xY&s=rIVbw3_TGEwpaet5ibCeYze6vSDUiPhwOzlV0z484fM&e=

Field.html

vs

https://urldefense.proofpoint.com/v2/url?u=https-

3A__lucene.apache.org_core_7-5F7-

5F2_core_org_apache_lucene_document_F&d=DwIFaQ&c=RoP1YumCXCgaW

HvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6rVk8db2H8dAcjS3WCWmAPd08C7JQCvZ
8W80yE9L5xY&s=yt1toHHZQBqd3qKpWeSzywGJhy928Q5qaEO4v9Lj3vg&e=

ield.html

Best regards

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-

unsubscr...@lucene.apache.org

For additional commands, e-mail:

java-user-h...@lucene.apache.org
---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-

unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-

h...@lucene.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-

unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-

h...@lucene.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-

unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-

h...@lucene.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-

unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-

h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen

https://urldefense.proofpoint.com/v2/url?u=https-

3A__www.thetaphi.de&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr

MUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=6ksT9ArMj83Yxf_GrxLNeJ4UFEeKdVdLK0Bl

OT0d754&s=33f2nq9rOLI5pN9e_RYl_TiEKnP_f4WMZ__vqyz2bzo&e=

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-

h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://urldefense.proofpoint.com/v2/url?u=https-

3A__www.thetaphi.de&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr

MUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-

BKNeyLlULCbaezrgocEvPhQkl4&m=owjI40OeLzt8gvPN44aTdndoiUel5E9Hqx1T

EcoWk_Y&s=xbZedNkQXb5eQcw_K7lCOP7b5ToKJVZ1dCPY3hi836c&e=

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index-time boosting: Deprecated setBoost method

Reply via email to