So because we're using BM25, I think this is a lower concern in general ( good chart in http://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html) We also disable norms on title fields (http://stackoverflow.com/questions/20222652/elasticsearch-when-to-set-omit-norms-option-as-false) FWIW.
Thanks for the link - Good info. I'm leaning toward something like you recommend in your keepWordFilter - but doing it at query time instead of index time. It doesn't seem like I need to use the memory to store "Socrates and Plato on Metaphysics" and also "Socrates Plato Metaphysics" - seems better to make the distinction at query time - and the performance should be the same because I need two search clauses anyway. On Monday, April 13, 2015 at 12:15:14 AM UTC+3, Doug Turnbull wrote: > > Yehosef, this sounds very similar to some title search work I've done. > Title fields are odd because TF is often meaningless, and IDF can also > Be quite skewed. If only a few titles have "how" in the text, then you'll > get very odd results. > > Read more here: > > http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/ > > On Sunday, April 12, 2015, Yehosef Shapiro <[email protected] <javascript:>> > wrote: > >> Often people using our search type "how to <something>" eg "how to >> paint my kitchen". This might result in results for "tips to paint my >> kitchen" or "how to paint my bathroom". the phrase "how to" is a generic >> phrase and I would like to minimize its significance. I don't want to >> remove it completely because I still would like a post called "how to paint >> my kitchen cabinets" to match higher than "should I wallpaper or paint my >> kitchen". >> >> I don't want it to be a stopword because it still has value (as in the >> example). >> >> The Common Terms query might work - but I don't necessarily want to apply >> the rules to all other common phrases (it might be a good idea - but this >> is a specific common search term that I know people search for and I would >> like to solve it specifically for this case if possible.) >> >> I don't think the negative boost is what I want because I don't want >> those documents to get penalized for containing the words "how to" - just >> that they should get a much smaller boost. >> >> Any suggestions how to approach this? For the record, I'm using the BM25 >> similarity algorithm. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, > LLC | 240.476.9983 | http://www.opensourceconnections.com > Author: Taming Search <http://manning.com/turnbull> from Manning > Publications > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless > of whether attachments are marked as such. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7cceb1d2-cefc-420b-bb97-bba2eb2b97fb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
