Hey folks,

I wanted to open up a discussion about a change to the usage of
SynonymQuery. The goal here is to have a broader library of queries that
can address other cases where related terms occupy the same position but
don't have the same meaning (such as hypernyms, hyponyms, meronyms,
ambiguous terms, and other query expansion situations).


I bring this up because we've noticed (as I'm sure many of you have) the
pattern of clients jamming any related term into a synonyms file and being
surprised with odd results. I like the idea of enforcing "synonyms" means
exactly-the-same in Lucene-land. It's an easy thing to tell a client and
setup simple patterns. So for synonyms, I think leaving SynonymQuery in
place works great.

But I feel if that's the rule, we need to open up discussion of other
methods of scoring conceptual 'related term' relationships that usually
comes up in the context of query expansion. This paper (
https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
the current thinking for scoring various query expansion scenarios like
those we deal with in the messy, ambiguous uses of synonyms in prod systems
(khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing
Lucene index stats. So one might imagine a 'related terms' token filter
that injected some scoring based on how related it really is to the
original query term using Jaccard, Dice, or other methods called out in
this paper.


Another insightful set of research is this article on concept scoring (
https://usabilityetc.com/articles/information-retrieval-concept-matching/),
which prioritizes related terms by connectedness and other factors.

Needless to say, it's an open area how two terms someone has asserted are
related to a query term 'should be' scored. It's one of those things that
likely will forever depend on a number of domain and application specific
factors. It's possibly a big opportunity of improvement for Lucene - but
likely is about putting the right framework in place to allow for good
default set of query-expansion scoring scenarios with options for
customization.

What I'm proposing is:


   -

   Submit a small patch that restricts SynonymQuery to tokens of type
   "SYNONYM" in the same posn, which allows some short term work to be done
   with the current Lucene QueryBuilder. Any additional non-synonym terms
   would be appended as a boolean query for now
   -

   Begin work on alternate 'related-term' scoring systems that also key off
   the token type in QueryBuilder to create custom scoring using built-in term
   stats. The possibilities here are endless, up to weighted related terms (ie
   Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by
other patches down the road for additional functionality?

(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)

-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug

Reply via email to