Re: how to modify term frequency formula?

Ivan Brusic Tue, 25 Mar 2014 09:05:22 -0700

Did you see Britta's slides? She has a slide called "Cosine similarity as
script" which mimics the Lucene scoring as a script. You can replace the
call to _index[field][word].tf() with your own implementation. You can
deploy the script as a native Java script (note: not Javascript) for
performance.


I find it easier to understand to just change the Similarity. Simply over
DefaultSimilarity and override "public float tf(float freq)" and then
reference this similarity in your field mapping.

-- 
Ivan


On Tue, Mar 25, 2014 at 6:57 AM, geantbrun <agin.patr...@gmail.com> wrote:

> Thanks again for the answer Ivan. Would it be simpler to modify directly
> in the source code the way tf is calculated? I mean replacing somewhere
> something like tf = sqrt(n) by tf = min(10,sqrt(n)).
> Cheers,
> Patrick
>
> Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :
>>
>> Term frequencies are stored within Lucene, so there is no calculating of
>> the value, just a lookup in the data structure. You can disable term
>> frequencies and then create your own in the script, but it would be easier
>> to calculate that value at index time so that you can access it within your
>> custom score and not have to iterate through all the terms yourself. Britta
>> has posted on the mailing list in the past, so hopefully she will reply
>> with some more authoritative answers, especially ones regarding performance.
>>
>> --
>> Ivan
>>
>>
>> On Fri, Mar 21, 2014 at 11:54 AM, geantbrun <agin.p...@gmail.com> wrote:
>>
>>> Thanks a lot Ivan, great answer.
>>>
>>> Suppose I use in my script my own formula for tf (with
>>> _index[field][term].tf()) and set the boost_mode to "replace", does
>>> elasticsearch calculate the tf two times or once only? In other words, is
>>> it computionnally efficient to calculate my own tf? Should I turn off other
>>> calculations made by es somewhere else to avoid double calculations?
>>>
>>> Cheers,
>>> Patrick
>>>
>>> Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :
>>>>
>>>> You can provide your own similarity to be used at the field level, but
>>>> recent version of elasticsearch allows you to access the tf-idf values in
>>>> order to do custom scoring [1]. Also look at Britta's recent talk on the
>>>> subject [2].
>>>>
>>>> That said, either your custom similarity or custom scoring would need
>>>> access to what exactly are the terms which are repeated many times. Have
>>>> you looked into omitting term frequencies? It would completely bypass using
>>>> term frequencies, which might be an overkill in your case. Look into the
>>>> index options [3].
>>>>
>>>> Finally, perhaps the common terms query can help [4].
>>>>
>>>> [1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc
>>>> e/current/modules-advanced-scripting.html
>>>>
>>>> [2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings
>>>>
>>>> [3] http://www.elasticsearch.org/guide/en/elasticsearch/refe
>>>> rence/current/mapping-core-types.html#string
>>>>
>>>> [4] http://www.elasticsearch.org/guide/en/elasticsearch/refe
>>>> rence/current/query-dsl-common-terms-query.html
>>>>
>>>> Cheers,
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 8:08 AM, geantbrun <agin.p...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> If I understand well, the formula used for the term frequency part in
>>>>> the default similarity module is the square root of the actual frequency.
>>>>> Is it possible to modify that formula to include something like a
>>>>> min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's
>>>>> for documents that have the same term repeated many times. It seems that
>>>>> BM25 similarity has a parameter to control saturation but I would prefer 
>>>>> to
>>>>> stick with the simple tf/idf similarity module.
>>>>> Thank you for your help
>>>>> Patrick
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40goo
>>>>> glegroups.com<https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%
>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC-%2B6rjUzw7k6VeT58_8RoEFg4YEY68g443VZTTQxAPzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to modify term frequency formula?

Reply via email to