Re: SynonymQuery / Query Expansion Strategies Discussion

jim ferenczi Thu, 22 Nov 2018 08:45:35 -0800

My proposal was to tweak the boosting directly in the token filters through
a single Attribute but if we feel that it is too much to add to the
analysis chain I agree that we don't need to add any API. If you rely on
abstract attributes (type, ...) then it should be easy to sub-class the
query builder to access them and implement the logic you want there.


Le jeu. 22 nov. 2018 à 13:18, Robert Muir <rcm...@gmail.com> a écrit :

> There is already analyzeBoolean/analyzeMultiBoolean there that you can
> use for this. You can look at any attribute on the tokenstream you
> want. I don't see any need to add any more API.
>
> On 11/21/18, Doug Turnbull <dturnb...@opensourceconnections.com> wrote:
> > I agree there is a tension between analysis and query parser
> > responsibilities (or external to how queries are constructed). I wonder
> > what you'd think of making QueryBuilder more easily subclassible by
> passing
> > more term metadata to newSynonymQuery (such as types etc). This would let
> > you select an alt strategy (such as some of the scoring systems used in
> the
> > query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> > something with a term labeled a hyponym/hypernym in a QueryBuilder
> > subclass..
> >
> > -Doug
> >
> > On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <rcm...@gmail.com> wrote:
> >
> >> I don't think we should put scoring stuff into the analysis chain like
> >> this. It already has a laundry list of responsibilities.
> >>
> >> Analysis chain can tell you the term is stacked or its a certain type
> >> or occurs a certain number of times, but it shouldn't be supplying
> >> things such as floating point boosts. That kind of scoring
> >> manipulation needs to really happen in query parsing/somewhere else.
> >>
> >> On 11/20/18, jim ferenczi <jim.feren...@gmail.com> wrote:
> >> > Sorry for the late reply,
> >> >
> >> >> So perhaps one way forward to contribute this sort of thing into
> >> >> Lucene
> >> > is we could implement additional QueryBuilder implementations that
> >> provide
> >> > such functionality?
> >> >
> >> > I am not sure, I mentioned Solr and ES because I thought it was about
> >> > adding taxonomies and complex expansion mechanisms to query builders
> >> > but
> >> I
> >> > wonder if we can have a simple
> >> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
> >> > could
> >> be
> >> > a new attribute that token filters would use when they produce stacked
> >> > tokens and that the QueryBuilder checks when he builds the
> >> > SynonymQuery.
> >> We
> >> > already have a TermFrequencyAttribute to alter the frequency of a term
> >> when
> >> > indexing so we could have the same mechanism for query term boosting ?
> >> >
> >> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> >> > dturnb...@opensourceconnections.com> a écrit :
> >> >
> >> >> Thanks Jim
> >> >>
> >> >> Yeah, now that I think about it - I agree that perhaps the simplest
> >> >> option
> >> >> would to create alternate query builders. I think there's a couple of
> >> >> enhancement to the base class that would be nice, such as
> >> >> - Some additional token attributes passed to newSynonymQuery, such as
> >> the
> >> >> type (was this a synonym or hyponym or something else...)
> >> >> - The ability to differentiate between the original query term and
> the
> >> >> generated synonym terms
> >> >> - Consistent support for phrases
> >> >>
> >> >> I think part of my goal too is to help people without the use of
> >> plugins.
> >> >> As we often are in scenarios at OpenSource Connections where people
> >> won't
> >> >> be able to use a plugin. In this case alternate expansions around
> >> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
> >> >> have
> >> >> using Solr/Lucene/ES.
> >> >>
> >> >> So perhaps one way forward to contribute this sort of thing into
> >> >> Lucene
> >> >> is
> >> >> we could implement additional QueryBuilder implementations that
> >> >> provide
> >> >> such functionality?
> >> >>
> >> >> Thanks
> >> >> -Doug
> >> >>
> >> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com
> >
> >> >> wrote:
> >> >>
> >> >>> You can easily customize the query that is used for synonyms in a
> >> custom
> >> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> >> >>> intended for subclasses that wish to customize the generated
> >> >>> queries."
> >> so
> >> >>> I
> >> >>> don't think we need to do anything there. I agree that it is
> >> >>> sometimes
> >> >>> better to use something different than the SynonymQuery but in the
> >> >>> general
> >> >>> case it works as expected and can be combined with other terms
> >> >>> naturally.
> >> >>> The kind of customization you want to achieve could be done in a
> >> >>> plugin
> >> >>> (or
> >> >>> in Solr or ES) that extends the QueryBuilder, you can also use
> custom
> >> >>> token
> >> >>> filters and alter the query the way you want. My point here is that
> >> >>> the
> >> >>> QueryBuilder should remain simple, you can add the complexity you
> >> >>> want
> >> in
> >> >>> a
> >> >>> subclass.
> >> >>> However I think there is another area we need to fix, the scoring of
> >> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
> >> >>> could
> >> >>> be
> >> >>> improved so we need something similar than the SynonymQuery that
> >> handles
> >> >>> multi phrases.
> >> >>>
> >> >>>
> >> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
> >> >>> dturnb...@opensourceconnections.com> a écrit :
> >> >>>
> >> >>>> Yes that is another good area (there are many). Although of course
> >> >>>> embeddings have their own challenges and complexities. (they often
> >> >>>> capture
> >> >>>> shared context, but not shared meaning).
> >> >>>>
> >> >>>> It's a data point though of something we'd want to include in such
> a
> >> >>>> framework, though not sure where it would go on the roadmap...
> >> >>>>
> >> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado
> >> >>>> <joaquin.delg...@gmail.com
> >> >
> >> >>>> wrote:
> >> >>>>
> >> >>>>> What about the use of word embeddings (see
> >> >>>>>
> >> >>>>>
> >>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
> >> )
> >> >>>>> to compute word similarity?
> >> >>>>>
> >> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> >> >>>>> dturnb...@opensourceconnections.com> wrote:
> >> >>>>>
> >> >>>>>> Hey folks,
> >> >>>>>>
> >> >>>>>> I wanted to open up a discussion about a change to the usage of
> >> >>>>>> SynonymQuery. The goal here is to have a broader library of
> >> >>>>>> queries
> >> >>>>>> that
> >> >>>>>> can address other cases where related terms occupy the same
> >> >>>>>> position
> >> >>>>>> but
> >> >>>>>> don't have the same meaning (such as hypernyms, hyponyms,
> >> >>>>>> meronyms,
> >> >>>>>> ambiguous terms, and other query expansion situations).
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> I bring this up because we've noticed (as I'm sure many of you
> >> >>>>>> have)
> >> >>>>>> the pattern of clients jamming any related term into a synonyms
> >> >>>>>> file
> >> >>>>>> and
> >> >>>>>> being surprised with odd results. I like the idea of enforcing
> >> >>>>>> "synonyms"
> >> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell
> >> >>>>>> a
> >> >>>>>> client
> >> >>>>>> and setup simple patterns. So for synonyms, I think leaving
> >> >>>>>> SynonymQuery in
> >> >>>>>> place works great.
> >> >>>>>>
> >> >>>>>> But I feel if that's the rule, we need to open up discussion of
> >> other
> >> >>>>>> methods of scoring conceptual 'related term' relationships that
> >> >>>>>> usually
> >> >>>>>> comes up in the context of query expansion. This paper (
> >> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
> >> >>>>>> surveys the current thinking for scoring various query expansion
> >> >>>>>> scenarios
> >> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms
> >> >>>>>> in
> >> >>>>>> prod
> >> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> The cool thing is many of the ideas in this paper seem doable
> with
> >> >>>>>> existing Lucene index stats. So one might imagine a 'related
> >> >>>>>> terms'
> >> >>>>>> token
> >> >>>>>> filter that injected some scoring based on how related it really
> >> >>>>>> is
> >> >>>>>> to the original query term using Jaccard, Dice, or other methods
> >> >>>>>> called out
> >> >>>>>> in this paper.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Another insightful set of research is this article on concept
> >> scoring
> >> >>>>>> (
> >> >>>>>>
> >>
> https://usabilityetc.com/articles/information-retrieval-concept-matching/
> >> >>>>>> ), which prioritizes related terms by connectedness and other
> >> >>>>>> factors.
> >> >>>>>>
> >> >>>>>> Needless to say, it's an open area how two terms someone has
> >> asserted
> >> >>>>>> are related to a query term 'should be' scored. It's one of those
> >> >>>>>> things
> >> >>>>>> that likely will forever depend on a number of domain and
> >> application
> >> >>>>>> specific factors. It's possibly a big opportunity of improvement
> >> >>>>>> for
> >> >>>>>> Lucene
> >> >>>>>> - but likely is about putting the right framework in place to
> >> >>>>>> allow
> >> >>>>>> for
> >> >>>>>> good default set of query-expansion scoring scenarios with
> options
> >> >>>>>> for
> >> >>>>>> customization.
> >> >>>>>>
> >> >>>>>> What I'm proposing is:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>    -
> >> >>>>>>
> >> >>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
> >> type
> >> >>>>>>    "SYNONYM" in the same posn, which allows some short term work
> >> >>>>>> to
> >> be
> >> >>>>>> done
> >> >>>>>>    with the current Lucene QueryBuilder. Any additional
> >> >>>>>> non-synonym
> >> >>>>>> terms
> >> >>>>>>    would be appended as a boolean query for now
> >> >>>>>>    -
> >> >>>>>>
> >> >>>>>>    Begin work on alternate 'related-term' scoring systems that
> >> >>>>>> also
> >> >>>>>>    key off the token type in QueryBuilder to create custom
> scoring
> >> >>>>>> using
> >> >>>>>>    built-in term stats. The possibilities here are endless, up to
> >> >>>>>> weighted
> >> >>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
> >> >>>>>> relevance
> >> >>>>>>    feedback, etc
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> I'm curious what folks would think of a patch for bullet one
> >> followed
> >> >>>>>> by other patches down the road for additional functionality?
> >> >>>>>>
> >> >>>>>> (related to discussion in this Elasticsearch PR
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >>
> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
> >> >>>>>> )
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> CTO, OpenSource Connections
> >> >>>>>> Author, Relevant Search
> >> >>>>>> http://o19s.com/doug
> >> >>>>>>
> >> >>>>> --
> >> >>>> CTO, OpenSource Connections
> >> >>>> Author, Relevant Search
> >> >>>> http://o19s.com/doug
> >> >>>>
> >> >>> --
> >> >> CTO, OpenSource Connections
> >> >> Author, Relevant Search
> >> >> http://o19s.com/doug
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >> --
> > CTO, OpenSource Connections
> > Author, Relevant Search
> > http://o19s.com/doug
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to