Multiple languages, boosting and, stemming and KeywordRepeat

Markus Jelsma Mon, 14 May 2018 02:53:13 -0700

Hello,

First, apologies for the weird subject line, and apologies for cross-posting, 
but last week it got no replies on the Solr user mailing list.


We index many languages and search over all those languages at once, but boost 
the language of the user's preference. To differentiate between stemmed tokens 
and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this works very 
well.

However, we just stumbled over the following example, q=australia is not 
stemmed in English, but its suffix is removed by the Romanian stemmer, causing 
the Romanian results to be returned on top of English results, despite language 
boosting.

This is because the Romanian part of the query consists of the stemmed and 
unstemmed version of the word, but the English part of the query is just one 
clause per field (title, content etc). Thus the Romanian results score roughtly 
twice that of English results.

Now, this is of course really obvious, but the 'solution' is not. To work 
around the problem i removed the RemoveDuplicates filter so i get two clauses 
for English as well, really ugly but it works. What i don't understand is the 
debug output, it doesn't list two identical clauses, instead, it doubled the 
boost on the field, so instead of:

    27.048403 = PayloadSpanQuery, product of:
      27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], 
result of:
        27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
          7.4 = boost
          3.084852 = idf(docFreq=14539, docCount=317894)
          1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - 
b + b * fieldLength / avgFieldLength)) from:
            4.0 = phraseFreq=4.0
            0.3 = parameter k1
            0.5 = parameter b
            15.08689 = avgFieldLength
            24.0 = fieldLength
      1.0 = AveragePayloadFunction.docScore()

I now get 

    54.096806 = PayloadSpanQuery, product of:
      54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], 
result of:
        54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
          14.8 = boost
          3.084852 = idf(docFreq=14539, docCount=317894)
          1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - 
b + b * fieldLength / avgFieldLength)) from:
            4.0 = phraseFreq=4.0
            0.3 = parameter k1
            0.5 = parameter b
            15.08689 = avgFieldLength
            24.0 = fieldLength
      1.0 = AveragePayloadFunction.docScore()

So instead of expecting two clauses in the debug, i get one but with a doubled 
boost.

The question is, is this supposed to be like this?

Also, are there any real solutions to this problem? Removing the 
RemoveDuplicates filter looks really silly.

Many thanks!
Markus


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Multiple languages, boosting and, stemming and KeywordRepeat

Reply via email to