Great thoughts Jim - +1 to your idea

One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I
think would lead to a different blending strategy for taxonomies than
synonyms.

If you have a taxonomy:

\shoes\dress_shoes\oxfords
\shoes\dress_shoes\wingtips
\shoes\lazy_shoes\loafers
\shoes\lazy_shoes\sketchers

This taxonomy states - if a document mentions 'oxfords', it's also
discussing the concept of dress shoes. If it only mentions 'wingtips' it
also is discussing dress shoes.

Thus ideally, the true document frequency of the parent concept 'dress
shoes' is the combination of the children. This is the number of documents
that discuss this concept.

You can repeat this for grandparent concepts. The number of documents with
'shoes' really is all the documents mentioning oxfords, wingtips, loafers,
sketchers, and the like...

We have implemented this idea at index time, with index-time semantic
expansion to inject the parent concepts. (manually put dress_shoes into
documents that just mention wingtips). This is mentioned in this blog post
https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
and
conference talk https://www.youtube.com/watch?v=90F30PS-884 This is
annoying and requires reindexing. Though it's the most accurate.

BUT I think a blended query-time query would capture the same semantics.
You basically want to score a taxonomy like the following. Imagine a user
query of wingtips, you could imagine 3 should clauses that blend at
different levels

- Search for the term 'wingtips' (lowest doc freq, smallest set)
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
highest df)

text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR
Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers,
text:loafers, ...)

Right now this can be accomplished by just issuing 3 SHOULD queries with 3
different query-time analyzers each with different synonym expansions
(exact user term, child => parent/sibling, child => parent, grandparent,
etc...). And maybe it should stay that way.

But this is why I think it's a 'yes AND', yes I think it would be a great
addition to have synonym weighting. AND I think there are blending
strategies that are specific to the use case.

-Doug



On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov <msoko...@gmail.com> wrote:

> This is a great idea. It would also be compelling to modify the term
> frequency using this deboosting so that stacked indexed terms can be
> weighted according to their closeness to the original term.
>
> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <jim.feren...@gmail.com wrote:
>
Sorry for the late reply,
>>
>> > So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> I am not sure, I mentioned Solr and ES because I thought it was about
>> adding taxonomies and complex expansion mechanisms to query builders but I
>> wonder if we can have a simple
>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>> be a new attribute that token filters would use when they produce stacked
>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>> already have a TermFrequencyAttribute to alter the frequency of a term when
>> indexing so we could have the same mechanism for query term boosting ?
>>
>> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> dturnb...@opensourceconnections.com> a écrit :
>>
> Thanks Jim
>>>
>>> Yeah, now that I think about it - I agree that perhaps the simplest
>>> option would to create alternate query builders. I think there's a couple
>>> of enhancement to the base class that would be nice, such as
>>> - Some additional token attributes passed to newSynonymQuery, such as
>>> the type (was this a synonym or hyponym or something else...)
>>> - The ability to differentiate between the original query term and the
>>> generated synonym terms
>>> - Consistent support for phrases
>>>
>>> I think part of my goal too is to help people without the use of
>>> plugins. As we often are in scenarios at OpenSource Connections where
>>> people won't be able to use a plugin. In this case alternate expansions
>>> around hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>>> have using Solr/Lucene/ES.
>>>
>>> So perhaps one way forward to contribute this sort of thing into Lucene
>>> is we could implement additional QueryBuilder implementations that provide
>>> such functionality?
>>>
>>> Thanks
>>> -Doug
>>>
>>
>>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com>
>>> wrote:
>>>
>>>> You can easily customize the query that is used for synonyms in a
>>>> custom QueryBuilder. The javadocs of the *newSynonymQuery* says "This
>>>> is intended for subclasses that wish to customize the generated queries."
>>>> so I don't think we need to do anything there. I agree that it is sometimes
>>>> better to use something different than the SynonymQuery but in the general
>>>> case it works as expected and can be combined with other terms naturally.
>>>> The kind of customization you want to achieve could be done in a plugin (or
>>>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>>>> filters and alter the query the way you want. My point here is that the
>>>> QueryBuilder should remain simple, you can add the complexity you want in a
>>>> subclass.
>>>> However I think there is another area we need to fix, the scoring of
>>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>>>> improved so we need something similar than the SynonymQuery that handles
>>>> multi phrases.
>>>>
>>>>
>>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>>> dturnb...@opensourceconnections.com> a écrit :
>>>>
>>>>> Yes that is another good area (there are many). Although of course
>>>>> embeddings have their own challenges and complexities. (they often capture
>>>>> shared context, but not shared meaning).
>>>>>
>>>>> It's a data point though of something we'd want to include in such a
>>>>> framework, though not sure where it would go on the roadmap...
>>>>>
>>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What about the use of word embeddings (see
>>>>>>
>>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>>> to compute word similarity?
>>>>>>
>>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>>> dturnb...@opensourceconnections.com> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>>> SynonymQuery. The goal here is to have a broader library of queries that
>>>>>>> can address other cases where related terms occupy the same position but
>>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>>
>>>>>>>
>>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>>> the pattern of clients jamming any related term into a synonyms file and
>>>>>>> being surprised with odd results. I like the idea of enforcing 
>>>>>>> "synonyms"
>>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a 
>>>>>>> client
>>>>>>> and setup simple patterns. So for synonyms, I think leaving 
>>>>>>> SynonymQuery in
>>>>>>> place works great.
>>>>>>>
>>>>>>> But I feel if that's the rule, we need to open up discussion of
>>>>>>> other methods of scoring conceptual 'related term' relationships that
>>>>>>> usually comes up in the context of query expansion. This paper (
>>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>>> surveys the current thinking for scoring various query expansion 
>>>>>>> scenarios
>>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>>
>>>>>>>
>>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>>> existing Lucene index stats. So one might imagine a 'related terms' 
>>>>>>> token
>>>>>>> filter that injected some scoring based on how related it really is
>>>>>>> to the original query term using Jaccard, Dice, or other methods called 
>>>>>>> out
>>>>>>> in this paper.
>>>>>>>
>>>>>>>
>>>>>>> Another insightful set of research is this article on concept
>>>>>>> scoring (
>>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>>> factors.
>>>>>>>
>>>>>>> Needless to say, it's an open area how two terms someone has
>>>>>>> asserted are related to a query term 'should be' scored. It's one of 
>>>>>>> those
>>>>>>> things that likely will forever depend on a number of domain and
>>>>>>> application specific factors. It's possibly a big opportunity of
>>>>>>> improvement for Lucene - but likely is about putting the right 
>>>>>>> framework in
>>>>>>> place to allow for good default set of query-expansion scoring scenarios
>>>>>>> with options for customization.
>>>>>>>
>>>>>>> What I'm proposing is:
>>>>>>>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>>>>>>>    type "SYNONYM" in the same posn, which allows some short term work 
>>>>>>> to be
>>>>>>>    done with the current Lucene QueryBuilder. Any additional 
>>>>>>> non-synonym terms
>>>>>>>    would be appended as a boolean query for now
>>>>>>>    -
>>>>>>>
>>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>>    key off the token type in QueryBuilder to create custom scoring using
>>>>>>>    built-in term stats. The possibilities here are endless, up to 
>>>>>>> weighted
>>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio relevance
>>>>>>>    feedback, etc
>>>>>>>
>>>>>>>
>>>>>>> I'm curious what folks would think of a patch for bullet one
>>>>>>> followed by other patches down the road for additional functionality?
>>>>>>>
>>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>>> )
>>>>>>>
>>>>>>> --
>>>>>>> CTO, OpenSource Connections
>>>>>>> Author, Relevant Search
>>>>>>> http://o19s.com/doug
>>>>>>>
>>>>>> --
>>>>> CTO, OpenSource Connections
>>>>> Author, Relevant Search
>>>>> http://o19s.com/doug
>>>>>
>>>> --
>>> CTO, OpenSource Connections
>>> Author, Relevant Search
>>> http://o19s.com/doug
>>>
>> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug

Reply via email to