Sorry for the late reply, > So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?
I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ? Le dim. 18 nov. 2018 à 02:24, Doug Turnbull < dturnb...@opensourceconnections.com> a écrit : > Thanks Jim > > Yeah, now that I think about it - I agree that perhaps the simplest option > would to create alternate query builders. I think there's a couple of > enhancement to the base class that would be nice, such as > - Some additional token attributes passed to newSynonymQuery, such as the > type (was this a synonym or hyponym or something else...) > - The ability to differentiate between the original query term and the > generated synonym terms > - Consistent support for phrases > > I think part of my goal too is to help people without the use of plugins. > As we often are in scenarios at OpenSource Connections where people won't > be able to use a plugin. In this case alternate expansions around > hypernyms/hyponyms/?... are a pretty frequent gap that search teams have > using Solr/Lucene/ES. > > So perhaps one way forward to contribute this sort of thing into Lucene is > we could implement additional QueryBuilder implementations that provide > such functionality? > > Thanks > -Doug > > On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com> > wrote: > >> You can easily customize the query that is used for synonyms in a custom >> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is >> intended for subclasses that wish to customize the generated queries." so I >> don't think we need to do anything there. I agree that it is sometimes >> better to use something different than the SynonymQuery but in the general >> case it works as expected and can be combined with other terms naturally. >> The kind of customization you want to achieve could be done in a plugin (or >> in Solr or ES) that extends the QueryBuilder, you can also use custom token >> filters and alter the query the way you want. My point here is that the >> QueryBuilder should remain simple, you can add the complexity you want in a >> subclass. >> However I think there is another area we need to fix, the scoring of >> multi-terms synonyms is broken (compared to the SynonymQuery) and could be >> improved so we need something similar than the SynonymQuery that handles >> multi phrases. >> >> >> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull < >> dturnb...@opensourceconnections.com> a écrit : >> >>> Yes that is another good area (there are many). Although of course >>> embeddings have their own challenges and complexities. (they often capture >>> shared context, but not shared meaning). >>> >>> It's a data point though of something we'd want to include in such a >>> framework, though not sure where it would go on the roadmap... >>> >>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com> >>> wrote: >>> >>>> What about the use of word embeddings (see >>>> >>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) >>>> to compute word similarity? >>>> >>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < >>>> dturnb...@opensourceconnections.com> wrote: >>>> >>>>> Hey folks, >>>>> >>>>> I wanted to open up a discussion about a change to the usage of >>>>> SynonymQuery. The goal here is to have a broader library of queries that >>>>> can address other cases where related terms occupy the same position but >>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms, >>>>> ambiguous terms, and other query expansion situations). >>>>> >>>>> >>>>> I bring this up because we've noticed (as I'm sure many of you have) >>>>> the pattern of clients jamming any related term into a synonyms file and >>>>> being surprised with odd results. I like the idea of enforcing "synonyms" >>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a client >>>>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery >>>>> in >>>>> place works great. >>>>> >>>>> But I feel if that's the rule, we need to open up discussion of other >>>>> methods of scoring conceptual 'related term' relationships that usually >>>>> comes up in the context of query expansion. This paper ( >>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, >>>>> surveys the current thinking for scoring various query expansion scenarios >>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod >>>>> systems (khakis aren't trousers, they're a kind-of trouser). >>>>> >>>>> >>>>> The cool thing is many of the ideas in this paper seem doable with >>>>> existing Lucene index stats. So one might imagine a 'related terms' token >>>>> filter that injected some scoring based on how related it really is >>>>> to the original query term using Jaccard, Dice, or other methods called >>>>> out >>>>> in this paper. >>>>> >>>>> >>>>> Another insightful set of research is this article on concept scoring ( >>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/ >>>>> ), which prioritizes related terms by connectedness and other factors. >>>>> >>>>> Needless to say, it's an open area how two terms someone has asserted >>>>> are related to a query term 'should be' scored. It's one of those things >>>>> that likely will forever depend on a number of domain and application >>>>> specific factors. It's possibly a big opportunity of improvement for >>>>> Lucene >>>>> - but likely is about putting the right framework in place to allow for >>>>> good default set of query-expansion scoring scenarios with options for >>>>> customization. >>>>> >>>>> What I'm proposing is: >>>>> >>>>> >>>>> - >>>>> >>>>> Submit a small patch that restricts SynonymQuery to tokens of type >>>>> "SYNONYM" in the same posn, which allows some short term work to be >>>>> done >>>>> with the current Lucene QueryBuilder. Any additional non-synonym terms >>>>> would be appended as a boolean query for now >>>>> - >>>>> >>>>> Begin work on alternate 'related-term' scoring systems that also >>>>> key off the token type in QueryBuilder to create custom scoring using >>>>> built-in term stats. The possibilities here are endless, up to weighted >>>>> related terms (ie Alessandro's patch), feeding back Rocchio relevance >>>>> feedback, etc >>>>> >>>>> >>>>> I'm curious what folks would think of a patch for bullet one followed >>>>> by other patches down the road for additional functionality? >>>>> >>>>> (related to discussion in this Elasticsearch PR >>>>> >>>>> >>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 >>>>> ) >>>>> >>>>> -- >>>>> CTO, OpenSource Connections >>>>> Author, Relevant Search >>>>> http://o19s.com/doug >>>>> >>>> -- >>> CTO, OpenSource Connections >>> Author, Relevant Search >>> http://o19s.com/doug >>> >> -- > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug >