+1 great idea Jim! On Tue, Nov 20, 2018 at 2:19 PM jim ferenczi <jim.feren...@gmail.com> wrote:
> Sorry for the late reply, > > > So perhaps one way forward to contribute this sort of thing into Lucene > is we could implement additional QueryBuilder implementations that provide > such functionality? > > I am not sure, I mentioned Solr and ES because I thought it was about > adding taxonomies and complex expansion mechanisms to query builders but I > wonder if we can have a simple > mechanism to just (de)boost stacked tokens in the QueryBuilder. It could > be a new attribute that token filters would use when they produce stacked > tokens and that the QueryBuilder checks when he builds the SynonymQuery. We > already have a TermFrequencyAttribute to alter the frequency of a term when > indexing so we could have the same mechanism for query term boosting ? > > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull < > dturnb...@opensourceconnections.com> a écrit : > >> Thanks Jim >> >> Yeah, now that I think about it - I agree that perhaps the simplest >> option would to create alternate query builders. I think there's a couple >> of enhancement to the base class that would be nice, such as >> - Some additional token attributes passed to newSynonymQuery, such as the >> type (was this a synonym or hyponym or something else...) >> - The ability to differentiate between the original query term and the >> generated synonym terms >> - Consistent support for phrases >> >> I think part of my goal too is to help people without the use of plugins. >> As we often are in scenarios at OpenSource Connections where people won't >> be able to use a plugin. In this case alternate expansions around >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have >> using Solr/Lucene/ES. >> >> So perhaps one way forward to contribute this sort of thing into Lucene >> is we could implement additional QueryBuilder implementations that provide >> such functionality? >> >> Thanks >> -Doug >> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com> >> wrote: >> >>> You can easily customize the query that is used for synonyms in a custom >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is >>> intended for subclasses that wish to customize the generated queries." so I >>> don't think we need to do anything there. I agree that it is sometimes >>> better to use something different than the SynonymQuery but in the general >>> case it works as expected and can be combined with other terms naturally. >>> The kind of customization you want to achieve could be done in a plugin (or >>> in Solr or ES) that extends the QueryBuilder, you can also use custom token >>> filters and alter the query the way you want. My point here is that the >>> QueryBuilder should remain simple, you can add the complexity you want in a >>> subclass. >>> However I think there is another area we need to fix, the scoring of >>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be >>> improved so we need something similar than the SynonymQuery that handles >>> multi phrases. >>> >>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull < >>> dturnb...@opensourceconnections.com> a écrit : >>> >>>> Yes that is another good area (there are many). Although of course >>>> embeddings have their own challenges and complexities. (they often capture >>>> shared context, but not shared meaning). >>>> >>>> It's a data point though of something we'd want to include in such a >>>> framework, though not sure where it would go on the roadmap... >>>> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com> >>>> wrote: >>>> >>>>> What about the use of word embeddings (see >>>>> >>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) >>>>> to compute word similarity? >>>>> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < >>>>> dturnb...@opensourceconnections.com> wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> I wanted to open up a discussion about a change to the usage of >>>>>> SynonymQuery. The goal here is to have a broader library of queries that >>>>>> can address other cases where related terms occupy the same position but >>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms, >>>>>> ambiguous terms, and other query expansion situations). >>>>>> >>>>>> >>>>>> I bring this up because we've noticed (as I'm sure many of you have) >>>>>> the pattern of clients jamming any related term into a synonyms file and >>>>>> being surprised with odd results. I like the idea of enforcing "synonyms" >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a >>>>>> client >>>>>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery >>>>>> in >>>>>> place works great. >>>>>> >>>>>> But I feel if that's the rule, we need to open up discussion of other >>>>>> methods of scoring conceptual 'related term' relationships that usually >>>>>> comes up in the context of query expansion. This paper ( >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, >>>>>> surveys the current thinking for scoring various query expansion >>>>>> scenarios >>>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod >>>>>> systems (khakis aren't trousers, they're a kind-of trouser). >>>>>> >>>>>> >>>>>> The cool thing is many of the ideas in this paper seem doable with >>>>>> existing Lucene index stats. So one might imagine a 'related terms' token >>>>>> filter that injected some scoring based on how related it really is >>>>>> to the original query term using Jaccard, Dice, or other methods called >>>>>> out >>>>>> in this paper. >>>>>> >>>>>> >>>>>> Another insightful set of research is this article on concept scoring >>>>>> ( >>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/ >>>>>> ), which prioritizes related terms by connectedness and other >>>>>> factors. >>>>>> >>>>>> Needless to say, it's an open area how two terms someone has asserted >>>>>> are related to a query term 'should be' scored. It's one of those things >>>>>> that likely will forever depend on a number of domain and application >>>>>> specific factors. It's possibly a big opportunity of improvement for >>>>>> Lucene >>>>>> - but likely is about putting the right framework in place to allow for >>>>>> good default set of query-expansion scoring scenarios with options for >>>>>> customization. >>>>>> >>>>>> What I'm proposing is: >>>>>> >>>>>> >>>>>> - >>>>>> >>>>>> Submit a small patch that restricts SynonymQuery to tokens of >>>>>> type "SYNONYM" in the same posn, which allows some short term work to >>>>>> be >>>>>> done with the current Lucene QueryBuilder. Any additional non-synonym >>>>>> terms >>>>>> would be appended as a boolean query for now >>>>>> - >>>>>> >>>>>> Begin work on alternate 'related-term' scoring systems that also >>>>>> key off the token type in QueryBuilder to create custom scoring using >>>>>> built-in term stats. The possibilities here are endless, up to >>>>>> weighted >>>>>> related terms (ie Alessandro's patch), feeding back Rocchio relevance >>>>>> feedback, etc >>>>>> >>>>>> >>>>>> I'm curious what folks would think of a patch for bullet one followed >>>>>> by other patches down the road for additional functionality? >>>>>> >>>>>> (related to discussion in this Elasticsearch PR >>>>>> >>>>>> >>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 >>>>>> ) >>>>>> >>>>>> -- >>>>>> CTO, OpenSource Connections >>>>>> Author, Relevant Search >>>>>> http://o19s.com/doug >>>>>> >>>>> -- >>>> CTO, OpenSource Connections >>>> Author, Relevant Search >>>> http://o19s.com/doug >>>> >>> -- >> CTO, OpenSource Connections >> Author, Relevant Search >> http://o19s.com/doug >> > -- Lucene/Solr Search Committer (PMC), Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com