Re: SynonymQuery / Query Expansion Strategies Discussion

Michael Gibney Wed, 21 Nov 2018 07:50:55 -0800

On the analysis chain side, could the desired functionality be scoped to:
providing a framework (Attribute?) to express information about the
relationship between a derived token and its corresponding input? For
example, one might include information about:
1. corresponding input token (i.e., input token text?)
2. relationship between derived token and input (e.g., synonym, hyponym,
hypernym ... but perhaps not limited to these)
3. degree of confidence/weight in the derived token? This would represent a
concept distinct from "weight" for the purpose of scoring, and could thus
be appropriate to the analysis chain.
4. source/reason of token derivation relationship (e.g., specific ontology,
taxonomy, etc...)
5. ....


This could provide all the information necessary to support custom indexing
strategies and/or query strategies, while remaining strictly focused on
analysis per se. This type of approach (if relationship info were recorded
in index, e.g. via Payload) could also support explicitly navigable facets
that are ontology-aware, or other potentially interesting things ...

Michael


On Wed, Nov 21, 2018 at 9:24 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <rcm...@gmail.com> wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi <jim.feren...@gmail.com> wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> but I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> could be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> SynonymQuery. We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > dturnb...@opensourceconnections.com> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com>
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> queries." so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>> >>> token
>> >>> filters and alter the query the way you want. My point here is that
>> the
>> >>> QueryBuilder should remain simple, you can add the complexity you
>> want in
>> >>> a
>> >>> subclass.
>> >>> However I think there is another area we need to fix, the scoring of
>> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
>> could
>> >>> be
>> >>> improved so we need something similar than the SynonymQuery that
>> handles
>> >>> multi phrases.
>> >>>
>> >>>
>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> >>> dturnb...@opensourceconnections.com> a écrit :
>> >>>
>> >>>> Yes that is another good area (there are many). Although of course
>> >>>> embeddings have their own challenges and complexities. (they often
>> >>>> capture
>> >>>> shared context, but not shared meaning).
>> >>>>
>> >>>> It's a data point though of something we'd want to include in such a
>> >>>> framework, though not sure where it would go on the roadmap...
>> >>>>
>> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <
>> joaquin.delg...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> What about the use of word embeddings (see
>> >>>>>
>> >>>>>
>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
>> )
>> >>>>> to compute word similarity?
>> >>>>>
>> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>> >>>>> dturnb...@opensourceconnections.com> wrote:
>> >>>>>
>> >>>>>> Hey folks,
>> >>>>>>
>> >>>>>> I wanted to open up a discussion about a change to the usage of
>> >>>>>> SynonymQuery. The goal here is to have a broader library of queries
>> >>>>>> that
>> >>>>>> can address other cases where related terms occupy the same
>> position
>> >>>>>> but
>> >>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> >>>>>> ambiguous terms, and other query expansion situations).
>> >>>>>>
>> >>>>>>
>> >>>>>> I bring this up because we've noticed (as I'm sure many of you
>> have)
>> >>>>>> the pattern of clients jamming any related term into a synonyms
>> file
>> >>>>>> and
>> >>>>>> being surprised with odd results. I like the idea of enforcing
>> >>>>>> "synonyms"
>> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a
>> >>>>>> client
>> >>>>>> and setup simple patterns. So for synonyms, I think leaving
>> >>>>>> SynonymQuery in
>> >>>>>> place works great.
>> >>>>>>
>> >>>>>> But I feel if that's the rule, we need to open up discussion of
>> other
>> >>>>>> methods of scoring conceptual 'related term' relationships that
>> >>>>>> usually
>> >>>>>> comes up in the context of query expansion. This paper (
>> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>> >>>>>> surveys the current thinking for scoring various query expansion
>> >>>>>> scenarios
>> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms in
>> >>>>>> prod
>> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>> >>>>>>
>> >>>>>>
>> >>>>>> The cool thing is many of the ideas in this paper seem doable with
>> >>>>>> existing Lucene index stats. So one might imagine a 'related terms'
>> >>>>>> token
>> >>>>>> filter that injected some scoring based on how related it really is
>> >>>>>> to the original query term using Jaccard, Dice, or other methods
>> >>>>>> called out
>> >>>>>> in this paper.
>> >>>>>>
>> >>>>>>
>> >>>>>> Another insightful set of research is this article on concept
>> scoring
>> >>>>>> (
>> >>>>>>
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> >>>>>> ), which prioritizes related terms by connectedness and other
>> >>>>>> factors.
>> >>>>>>
>> >>>>>> Needless to say, it's an open area how two terms someone has
>> asserted
>> >>>>>> are related to a query term 'should be' scored. It's one of those
>> >>>>>> things
>> >>>>>> that likely will forever depend on a number of domain and
>> application
>> >>>>>> specific factors. It's possibly a big opportunity of improvement
>> for
>> >>>>>> Lucene
>> >>>>>> - but likely is about putting the right framework in place to allow
>> >>>>>> for
>> >>>>>> good default set of query-expansion scoring scenarios with options
>> >>>>>> for
>> >>>>>> customization.
>> >>>>>>
>> >>>>>> What I'm proposing is:
>> >>>>>>
>> >>>>>>
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>> type
>> >>>>>>    "SYNONYM" in the same posn, which allows some short term work
>> to be
>> >>>>>> done
>> >>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym
>> >>>>>> terms
>> >>>>>>    would be appended as a boolean query for now
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Begin work on alternate 'related-term' scoring systems that also
>> >>>>>>    key off the token type in QueryBuilder to create custom scoring
>> >>>>>> using
>> >>>>>>    built-in term stats. The possibilities here are endless, up to
>> >>>>>> weighted
>> >>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>> >>>>>> relevance
>> >>>>>>    feedback, etc
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm curious what folks would think of a patch for bullet one
>> followed
>> >>>>>> by other patches down the road for additional functionality?
>> >>>>>>
>> >>>>>> (related to discussion in this Elasticsearch PR
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> >>>>>> )
>> >>>>>>
>> >>>>>> --
>> >>>>>> CTO, OpenSource Connections
>> >>>>>> Author, Relevant Search
>> >>>>>> http://o19s.com/doug
>> >>>>>>
>> >>>>> --
>> >>>> CTO, OpenSource Connections
>> >>>> Author, Relevant Search
>> >>>> http://o19s.com/doug
>> >>>>
>> >>> --
>> >> CTO, OpenSource Connections
>> >> Author, Relevant Search
>> >> http://o19s.com/doug
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to