Did you see Britta's slides? She has a slide called "Cosine similarity as script" which mimics the Lucene scoring as a script. You can replace the call to _index[field][word].tf() with your own implementation. You can deploy the script as a native Java script (note: not Javascript) for performance.
I find it easier to understand to just change the Similarity. Simply over DefaultSimilarity and override "public float tf(float freq)" and then reference this similarity in your field mapping. -- Ivan On Tue, Mar 25, 2014 at 6:57 AM, geantbrun <agin.patr...@gmail.com> wrote: > Thanks again for the answer Ivan. Would it be simpler to modify directly > in the source code the way tf is calculated? I mean replacing somewhere > something like tf = sqrt(n) by tf = min(10,sqrt(n)). > Cheers, > Patrick > > Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit : >> >> Term frequencies are stored within Lucene, so there is no calculating of >> the value, just a lookup in the data structure. You can disable term >> frequencies and then create your own in the script, but it would be easier >> to calculate that value at index time so that you can access it within your >> custom score and not have to iterate through all the terms yourself. Britta >> has posted on the mailing list in the past, so hopefully she will reply >> with some more authoritative answers, especially ones regarding performance. >> >> -- >> Ivan >> >> >> On Fri, Mar 21, 2014 at 11:54 AM, geantbrun <agin.p...@gmail.com> wrote: >> >>> Thanks a lot Ivan, great answer. >>> >>> Suppose I use in my script my own formula for tf (with >>> _index[field][term].tf()) and set the boost_mode to "replace", does >>> elasticsearch calculate the tf two times or once only? In other words, is >>> it computionnally efficient to calculate my own tf? Should I turn off other >>> calculations made by es somewhere else to avoid double calculations? >>> >>> Cheers, >>> Patrick >>> >>> Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit : >>>> >>>> You can provide your own similarity to be used at the field level, but >>>> recent version of elasticsearch allows you to access the tf-idf values in >>>> order to do custom scoring [1]. Also look at Britta's recent talk on the >>>> subject [2]. >>>> >>>> That said, either your custom similarity or custom scoring would need >>>> access to what exactly are the terms which are repeated many times. Have >>>> you looked into omitting term frequencies? It would completely bypass using >>>> term frequencies, which might be an overkill in your case. Look into the >>>> index options [3]. >>>> >>>> Finally, perhaps the common terms query can help [4]. >>>> >>>> [1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc >>>> e/current/modules-advanced-scripting.html >>>> >>>> [2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings >>>> >>>> [3] http://www.elasticsearch.org/guide/en/elasticsearch/refe >>>> rence/current/mapping-core-types.html#string >>>> >>>> [4] http://www.elasticsearch.org/guide/en/elasticsearch/refe >>>> rence/current/query-dsl-common-terms-query.html >>>> >>>> Cheers, >>>> >>>> Ivan >>>> >>>> >>>> On Thu, Mar 20, 2014 at 8:08 AM, geantbrun <agin.p...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> If I understand well, the formula used for the term frequency part in >>>>> the default similarity module is the square root of the actual frequency. >>>>> Is it possible to modify that formula to include something like a >>>>> min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's >>>>> for documents that have the same term repeated many times. It seems that >>>>> BM25 similarity has a parameter to control saturation but I would prefer >>>>> to >>>>> stick with the simple tf/idf similarity module. >>>>> Thank you for your help >>>>> Patrick >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40goo >>>>> glegroups.com<https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f% >>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC-%2B6rjUzw7k6VeT58_8RoEFg4YEY68g443VZTTQxAPzw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.