[jira] [Commented] (OAK-3367) Boosting fields not working as expected

Chetan Mehrotra (JIRA) Tue, 08 Sep 2015 02:32:06 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734510#comment-14734510
 ]


Chetan Mehrotra commented on OAK-3367:
--------------------------------------

Based on internal discussion with [~tmueller] and [~teofili] we have following 
options 

h4. Approach A - Index time boost with boost information stored in payload via 
custom tokenizer

Implementation wise this feature is similar to what is [supported in 
Elasticsearch|http://jontai.me/blog/2012/10/lucene-scoring-and-elasticsearch-_all-field/].
 

*Work Required*
# Need to provide a custom tokenize similar to what [Elasticsearch has 
done|https://github.com/elastic/elasticsearch/issues/63]

*Pros*
# Simpler to use for end user as the user does not have to mention all possible 
such field in the query

*Cons*
# Index time boosting has its drawback. See 
[here|https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html#index-boost]
# It makes more sense with conditional indexing support. See 
[http://stackoverflow.com/a/9398823/1035417]
{quote}
Index time field boosts are a way to express things like "this document's title 
is worth twice as much as the title of most documents". Query time boosts are a 
way to express "I care about matches on this clause of my query twice as much 
as I do about matches on other clauses of my query".

Index time field boosts are worthless if you set them on every document.
{quote}

h4. Approach B - Index time boost with query involving multiple clauses

Compared to above usecase Jackrabbit 2.x so far supported boosting in a 
different way. See [1] for details. This requires the user to phrase the query 
in a different way i.e. explicitly have OR clauses for multiple fields

bq. Note: The boost in this case is respected only if a jcr:contains() is done 
on the corresponding property, for example jcr:contains(@jcr:title, 'find 
this'). If there is only a jcr:contains(., 'find this'),  the boosts at 
indexing time have no effect.

{code}
/jcr:root/content/geometrixx-outdoors/en//element(*, cq:Page)
   [
       jcr:contains(@jcr:title, 'Keyword')
       OR  jcr:contains(@jcr:description, 'Keyword')
       OR  jcr:contains(., 'Keyword')
  ] order by @jcr:score descending
{code}

*Work Required*
# Boosting would need to be done on a per field basis and can be applied at 
query time (suggested by Tommaso). {{LucenePropertyIndex}} can check if the 
query is being applied against specific field and then can boost that query 
clause based on property definition.
# OR we fix the editor to create field with boost level set 
# In addition we would need to ensure that when results for multiple OR clauses 
are combined then results are merge sorted based on jcr:score (OAK-2944 to be 
merged to branches for this)

*Pros*
# Boost logic more apparent and can be changed without requiring reindexing
# Behavior compatible with JR2

*Cons*
# Queries need to be written in ways explained above

h4. Approach C - Query time boost with query expanded by LucenePropertyIndex

* User would still specify the normal query i.e. just search on node
* On index config side he would mark the field which needs to be given special 
boost with {{analyzed}} and {{nodeScopeIndex}} set to true and boost specified. 
However we would not add the boost at indexing time yet!
* On query side {{LucenePropertyIndex}} would translate the search on node to 
multiple OR clauses with TermQuery for all configured field having {{analyzed}} 
and {{nodeScopeIndex}} set to true in addition to TermQuery on node level 
fulltext field

This approach combines best of both above approach query time boosting and not 
let user rephrase the query!

[1] https://helpx.adobe.com/experience-manager/kb/BoostInSearch.html


> Boosting fields not working as expected
> ---------------------------------------
>
>                 Key: OAK-3367
>                 URL: https://issues.apache.org/jira/browse/OAK-3367
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.6
>
>
> When the boost support was added the intention was to support a usecase like 
> {quote}
> For the fulltext search on a node where the fulltext content is derived from 
> multiple field it should be possible to boost specific text contributed by 
> individual field. Meaning that if a title field is boosted more than 
> description, the title (part) in the fulltext field will mean more than the 
> description (part) in the fulltext field.
> {quote}
> This would enable a user to perform a search like 
> _/jcr:root/content/geometrixx-outdoors/en//element(*, 
> cq:Page)\[jcr:contains(., 'Keyword')\]_ and get a result where pages having 
> 'Keyword' in title come above in search result compared to those where 
> Keyword is found in description.
> Current implementation just sets the boost while add the field value to 
> fulltext field with the intention that Lucene would use the boost as 
> explained above. However it does not work like that and boost value gets 
> multiplies with other field and hence boosting does not work as expected



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-3367) Boosting fields not working as expected

Reply via email to