Re: SynonymQuery / Query Expansion Strategies Discussion

David Smiley Tue, 20 Nov 2018 11:25:16 -0800

+1 great idea Jim!

On Tue, Nov 20, 2018 at 2:19 PM jim ferenczi <jim.feren...@gmail.com> wrote:


> Sorry for the late reply,
>
> > So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
> be a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option would to create alternate query builders. I think there's a couple
>> of enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com>
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the general
>>> case it works as expected and can be combined with other terms naturally.
>>> The kind of customization you want to achieve could be done in a plugin (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> dturnb...@opensourceconnections.com> a écrit :
>>>
>>>> Yes that is another good area (there are many). Although of course
>>>> embeddings have their own challenges and complexities. (they often capture
>>>> shared context, but not shared meaning).
>>>>
>>>> It's a data point though of something we'd want to include in such a
>>>> framework, though not sure where it would go on the roadmap...
>>>>
>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com>
>>>> wrote:
>>>>
>>>>> What about the use of word embeddings (see
>>>>>
>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>> to compute word similarity?
>>>>>
>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>> dturnb...@opensourceconnections.com> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>> SynonymQuery. The goal here is to have a broader library of queries that
>>>>>> can address other cases where related terms occupy the same position but
>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>
>>>>>>
>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>> the pattern of clients jamming any related term into a synonyms file and
>>>>>> being surprised with odd results. I like the idea of enforcing "synonyms"
>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a 
>>>>>> client
>>>>>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery 
>>>>>> in
>>>>>> place works great.
>>>>>>
>>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>>> methods of scoring conceptual 'related term' relationships that usually
>>>>>> comes up in the context of query expansion. This paper (
>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>> surveys the current thinking for scoring various query expansion 
>>>>>> scenarios
>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>
>>>>>>
>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>> existing Lucene index stats. So one might imagine a 'related terms' token
>>>>>> filter that injected some scoring based on how related it really is
>>>>>> to the original query term using Jaccard, Dice, or other methods called 
>>>>>> out
>>>>>> in this paper.
>>>>>>
>>>>>>
>>>>>> Another insightful set of research is this article on concept scoring
>>>>>> (
>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>> factors.
>>>>>>
>>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>>> are related to a query term 'should be' scored. It's one of those things
>>>>>> that likely will forever depend on a number of domain and application
>>>>>> specific factors. It's possibly a big opportunity of improvement for 
>>>>>> Lucene
>>>>>> - but likely is about putting the right framework in place to allow for
>>>>>> good default set of query-expansion scoring scenarios with options for
>>>>>> customization.
>>>>>>
>>>>>> What I'm proposing is:
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>>>>>>    type "SYNONYM" in the same posn, which allows some short term work to 
>>>>>> be
>>>>>>    done with the current Lucene QueryBuilder. Any additional non-synonym 
>>>>>> terms
>>>>>>    would be appended as a boolean query for now
>>>>>>    -
>>>>>>
>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>    key off the token type in QueryBuilder to create custom scoring using
>>>>>>    built-in term stats. The possibilities here are endless, up to 
>>>>>> weighted
>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio relevance
>>>>>>    feedback, etc
>>>>>>
>>>>>>
>>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>>> by other patches down the road for additional functionality?
>>>>>>
>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>
>>>>>>
>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>> )
>>>>>>
>>>>>> --
>>>>>> CTO, OpenSource Connections
>>>>>> Author, Relevant Search
>>>>>> http://o19s.com/doug
>>>>>>
>>>>> --
>>>> CTO, OpenSource Connections
>>>> Author, Relevant Search
>>>> http://o19s.com/doug
>>>>
>>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
> --
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to