Re: SynonymQuery / Query Expansion Strategies Discussion

jim ferenczi Tue, 20 Nov 2018 11:19:47 -0800

Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene
is we could implement additional QueryBuilder implementations that provide
such functionality?


I am not sure, I mentioned Solr and ES because I thought it was about
adding taxonomies and complex expansion mechanisms to query builders but I
wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
a new attribute that token filters would use when they produce stacked
tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
already have a TermFrequencyAttribute to alter the frequency of a term when
indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
dturnb...@opensourceconnections.com> a écrit :

> Thanks Jim
>
> Yeah, now that I think about it - I agree that perhaps the simplest option
> would to create alternate query builders. I think there's a couple of
> enhancement to the base class that would be nice, such as
> - Some additional token attributes passed to newSynonymQuery, such as the
> type (was this a synonym or hyponym or something else...)
> - The ability to differentiate between the original query term and the
> generated synonym terms
> - Consistent support for phrases
>
> I think part of my goal too is to help people without the use of plugins.
> As we often are in scenarios at OpenSource Connections where people won't
> be able to use a plugin. In this case alternate expansions around
> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
> using Solr/Lucene/ES.
>
> So perhaps one way forward to contribute this sort of thing into Lucene is
> we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> Thanks
> -Doug
>
> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com>
> wrote:
>
>> You can easily customize the query that is used for synonyms in a custom
>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> intended for subclasses that wish to customize the generated queries." so I
>> don't think we need to do anything there. I agree that it is sometimes
>> better to use something different than the SynonymQuery but in the general
>> case it works as expected and can be combined with other terms naturally.
>> The kind of customization you want to achieve could be done in a plugin (or
>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>> filters and alter the query the way you want. My point here is that the
>> QueryBuilder should remain simple, you can add the complexity you want in a
>> subclass.
>> However I think there is another area we need to fix, the scoring of
>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>> improved so we need something similar than the SynonymQuery that handles
>> multi phrases.
>>
>>
>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> dturnb...@opensourceconnections.com> a écrit :
>>
>>> Yes that is another good area (there are many). Although of course
>>> embeddings have their own challenges and complexities. (they often capture
>>> shared context, but not shared meaning).
>>>
>>> It's a data point though of something we'd want to include in such a
>>> framework, though not sure where it would go on the roadmap...
>>>
>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com>
>>> wrote:
>>>
>>>> What about the use of word embeddings (see
>>>>
>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>> to compute word similarity?
>>>>
>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>> dturnb...@opensourceconnections.com> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> I wanted to open up a discussion about a change to the usage of
>>>>> SynonymQuery. The goal here is to have a broader library of queries that
>>>>> can address other cases where related terms occupy the same position but
>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>> ambiguous terms, and other query expansion situations).
>>>>>
>>>>>
>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>> the pattern of clients jamming any related term into a synonyms file and
>>>>> being surprised with odd results. I like the idea of enforcing "synonyms"
>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a client
>>>>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery 
>>>>> in
>>>>> place works great.
>>>>>
>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>> methods of scoring conceptual 'related term' relationships that usually
>>>>> comes up in the context of query expansion. This paper (
>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>> surveys the current thinking for scoring various query expansion scenarios
>>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>
>>>>>
>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>> existing Lucene index stats. So one might imagine a 'related terms' token
>>>>> filter that injected some scoring based on how related it really is
>>>>> to the original query term using Jaccard, Dice, or other methods called 
>>>>> out
>>>>> in this paper.
>>>>>
>>>>>
>>>>> Another insightful set of research is this article on concept scoring (
>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>> ), which prioritizes related terms by connectedness and other factors.
>>>>>
>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>> are related to a query term 'should be' scored. It's one of those things
>>>>> that likely will forever depend on a number of domain and application
>>>>> specific factors. It's possibly a big opportunity of improvement for 
>>>>> Lucene
>>>>> - but likely is about putting the right framework in place to allow for
>>>>> good default set of query-expansion scoring scenarios with options for
>>>>> customization.
>>>>>
>>>>> What I'm proposing is:
>>>>>
>>>>>
>>>>>    -
>>>>>
>>>>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>>>>    "SYNONYM" in the same posn, which allows some short term work to be 
>>>>> done
>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym terms
>>>>>    would be appended as a boolean query for now
>>>>>    -
>>>>>
>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>    key off the token type in QueryBuilder to create custom scoring using
>>>>>    built-in term stats. The possibilities here are endless, up to weighted
>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio relevance
>>>>>    feedback, etc
>>>>>
>>>>>
>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>> by other patches down the road for additional functionality?
>>>>>
>>>>> (related to discussion in this Elasticsearch PR
>>>>>
>>>>>
>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>> )
>>>>>
>>>>> --
>>>>> CTO, OpenSource Connections
>>>>> Author, Relevant Search
>>>>> http://o19s.com/doug
>>>>>
>>>> --
>>> CTO, OpenSource Connections
>>> Author, Relevant Search
>>> http://o19s.com/doug
>>>
>> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to