My proposal was to tweak the boosting directly in the token filters through a single Attribute but if we feel that it is too much to add to the analysis chain I agree that we don't need to add any API. If you rely on abstract attributes (type, ...) then it should be easy to sub-class the query builder to access them and implement the logic you want there.
Le jeu. 22 nov. 2018 à 13:18, Robert Muir <rcm...@gmail.com> a écrit : > There is already analyzeBoolean/analyzeMultiBoolean there that you can > use for this. You can look at any attribute on the tokenstream you > want. I don't see any need to add any more API. > > On 11/21/18, Doug Turnbull <dturnb...@opensourceconnections.com> wrote: > > I agree there is a tension between analysis and query parser > > responsibilities (or external to how queries are constructed). I wonder > > what you'd think of making QueryBuilder more easily subclassible by > passing > > more term metadata to newSynonymQuery (such as types etc). This would let > > you select an alt strategy (such as some of the scoring systems used in > the > > query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing > > something with a term labeled a hyponym/hypernym in a QueryBuilder > > subclass.. > > > > -Doug > > > > On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <rcm...@gmail.com> wrote: > > > >> I don't think we should put scoring stuff into the analysis chain like > >> this. It already has a laundry list of responsibilities. > >> > >> Analysis chain can tell you the term is stacked or its a certain type > >> or occurs a certain number of times, but it shouldn't be supplying > >> things such as floating point boosts. That kind of scoring > >> manipulation needs to really happen in query parsing/somewhere else. > >> > >> On 11/20/18, jim ferenczi <jim.feren...@gmail.com> wrote: > >> > Sorry for the late reply, > >> > > >> >> So perhaps one way forward to contribute this sort of thing into > >> >> Lucene > >> > is we could implement additional QueryBuilder implementations that > >> provide > >> > such functionality? > >> > > >> > I am not sure, I mentioned Solr and ES because I thought it was about > >> > adding taxonomies and complex expansion mechanisms to query builders > >> > but > >> I > >> > wonder if we can have a simple > >> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It > >> > could > >> be > >> > a new attribute that token filters would use when they produce stacked > >> > tokens and that the QueryBuilder checks when he builds the > >> > SynonymQuery. > >> We > >> > already have a TermFrequencyAttribute to alter the frequency of a term > >> when > >> > indexing so we could have the same mechanism for query term boosting ? > >> > > >> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull < > >> > dturnb...@opensourceconnections.com> a écrit : > >> > > >> >> Thanks Jim > >> >> > >> >> Yeah, now that I think about it - I agree that perhaps the simplest > >> >> option > >> >> would to create alternate query builders. I think there's a couple of > >> >> enhancement to the base class that would be nice, such as > >> >> - Some additional token attributes passed to newSynonymQuery, such as > >> the > >> >> type (was this a synonym or hyponym or something else...) > >> >> - The ability to differentiate between the original query term and > the > >> >> generated synonym terms > >> >> - Consistent support for phrases > >> >> > >> >> I think part of my goal too is to help people without the use of > >> plugins. > >> >> As we often are in scenarios at OpenSource Connections where people > >> won't > >> >> be able to use a plugin. In this case alternate expansions around > >> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams > >> >> have > >> >> using Solr/Lucene/ES. > >> >> > >> >> So perhaps one way forward to contribute this sort of thing into > >> >> Lucene > >> >> is > >> >> we could implement additional QueryBuilder implementations that > >> >> provide > >> >> such functionality? > >> >> > >> >> Thanks > >> >> -Doug > >> >> > >> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <jim.feren...@gmail.com > > > >> >> wrote: > >> >> > >> >>> You can easily customize the query that is used for synonyms in a > >> custom > >> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is > >> >>> intended for subclasses that wish to customize the generated > >> >>> queries." > >> so > >> >>> I > >> >>> don't think we need to do anything there. I agree that it is > >> >>> sometimes > >> >>> better to use something different than the SynonymQuery but in the > >> >>> general > >> >>> case it works as expected and can be combined with other terms > >> >>> naturally. > >> >>> The kind of customization you want to achieve could be done in a > >> >>> plugin > >> >>> (or > >> >>> in Solr or ES) that extends the QueryBuilder, you can also use > custom > >> >>> token > >> >>> filters and alter the query the way you want. My point here is that > >> >>> the > >> >>> QueryBuilder should remain simple, you can add the complexity you > >> >>> want > >> in > >> >>> a > >> >>> subclass. > >> >>> However I think there is another area we need to fix, the scoring of > >> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and > >> >>> could > >> >>> be > >> >>> improved so we need something similar than the SynonymQuery that > >> handles > >> >>> multi phrases. > >> >>> > >> >>> > >> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull < > >> >>> dturnb...@opensourceconnections.com> a écrit : > >> >>> > >> >>>> Yes that is another good area (there are many). Although of course > >> >>>> embeddings have their own challenges and complexities. (they often > >> >>>> capture > >> >>>> shared context, but not shared meaning). > >> >>>> > >> >>>> It's a data point though of something we'd want to include in such > a > >> >>>> framework, though not sure where it would go on the roadmap... > >> >>>> > >> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado > >> >>>> <joaquin.delg...@gmail.com > >> > > >> >>>> wrote: > >> >>>> > >> >>>>> What about the use of word embeddings (see > >> >>>>> > >> >>>>> > >> > https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa > >> ) > >> >>>>> to compute word similarity? > >> >>>>> > >> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < > >> >>>>> dturnb...@opensourceconnections.com> wrote: > >> >>>>> > >> >>>>>> Hey folks, > >> >>>>>> > >> >>>>>> I wanted to open up a discussion about a change to the usage of > >> >>>>>> SynonymQuery. The goal here is to have a broader library of > >> >>>>>> queries > >> >>>>>> that > >> >>>>>> can address other cases where related terms occupy the same > >> >>>>>> position > >> >>>>>> but > >> >>>>>> don't have the same meaning (such as hypernyms, hyponyms, > >> >>>>>> meronyms, > >> >>>>>> ambiguous terms, and other query expansion situations). > >> >>>>>> > >> >>>>>> > >> >>>>>> I bring this up because we've noticed (as I'm sure many of you > >> >>>>>> have) > >> >>>>>> the pattern of clients jamming any related term into a synonyms > >> >>>>>> file > >> >>>>>> and > >> >>>>>> being surprised with odd results. I like the idea of enforcing > >> >>>>>> "synonyms" > >> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell > >> >>>>>> a > >> >>>>>> client > >> >>>>>> and setup simple patterns. So for synonyms, I think leaving > >> >>>>>> SynonymQuery in > >> >>>>>> place works great. > >> >>>>>> > >> >>>>>> But I feel if that's the rule, we need to open up discussion of > >> other > >> >>>>>> methods of scoring conceptual 'related term' relationships that > >> >>>>>> usually > >> >>>>>> comes up in the context of query expansion. This paper ( > >> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, > >> >>>>>> surveys the current thinking for scoring various query expansion > >> >>>>>> scenarios > >> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms > >> >>>>>> in > >> >>>>>> prod > >> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser). > >> >>>>>> > >> >>>>>> > >> >>>>>> The cool thing is many of the ideas in this paper seem doable > with > >> >>>>>> existing Lucene index stats. So one might imagine a 'related > >> >>>>>> terms' > >> >>>>>> token > >> >>>>>> filter that injected some scoring based on how related it really > >> >>>>>> is > >> >>>>>> to the original query term using Jaccard, Dice, or other methods > >> >>>>>> called out > >> >>>>>> in this paper. > >> >>>>>> > >> >>>>>> > >> >>>>>> Another insightful set of research is this article on concept > >> scoring > >> >>>>>> ( > >> >>>>>> > >> > https://usabilityetc.com/articles/information-retrieval-concept-matching/ > >> >>>>>> ), which prioritizes related terms by connectedness and other > >> >>>>>> factors. > >> >>>>>> > >> >>>>>> Needless to say, it's an open area how two terms someone has > >> asserted > >> >>>>>> are related to a query term 'should be' scored. It's one of those > >> >>>>>> things > >> >>>>>> that likely will forever depend on a number of domain and > >> application > >> >>>>>> specific factors. It's possibly a big opportunity of improvement > >> >>>>>> for > >> >>>>>> Lucene > >> >>>>>> - but likely is about putting the right framework in place to > >> >>>>>> allow > >> >>>>>> for > >> >>>>>> good default set of query-expansion scoring scenarios with > options > >> >>>>>> for > >> >>>>>> customization. > >> >>>>>> > >> >>>>>> What I'm proposing is: > >> >>>>>> > >> >>>>>> > >> >>>>>> - > >> >>>>>> > >> >>>>>> Submit a small patch that restricts SynonymQuery to tokens of > >> type > >> >>>>>> "SYNONYM" in the same posn, which allows some short term work > >> >>>>>> to > >> be > >> >>>>>> done > >> >>>>>> with the current Lucene QueryBuilder. Any additional > >> >>>>>> non-synonym > >> >>>>>> terms > >> >>>>>> would be appended as a boolean query for now > >> >>>>>> - > >> >>>>>> > >> >>>>>> Begin work on alternate 'related-term' scoring systems that > >> >>>>>> also > >> >>>>>> key off the token type in QueryBuilder to create custom > scoring > >> >>>>>> using > >> >>>>>> built-in term stats. The possibilities here are endless, up to > >> >>>>>> weighted > >> >>>>>> related terms (ie Alessandro's patch), feeding back Rocchio > >> >>>>>> relevance > >> >>>>>> feedback, etc > >> >>>>>> > >> >>>>>> > >> >>>>>> I'm curious what folks would think of a patch for bullet one > >> followed > >> >>>>>> by other patches down the road for additional functionality? > >> >>>>>> > >> >>>>>> (related to discussion in this Elasticsearch PR > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> > https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 > >> >>>>>> ) > >> >>>>>> > >> >>>>>> -- > >> >>>>>> CTO, OpenSource Connections > >> >>>>>> Author, Relevant Search > >> >>>>>> http://o19s.com/doug > >> >>>>>> > >> >>>>> -- > >> >>>> CTO, OpenSource Connections > >> >>>> Author, Relevant Search > >> >>>> http://o19s.com/doug > >> >>>> > >> >>> -- > >> >> CTO, OpenSource Connections > >> >> Author, Relevant Search > >> >> http://o19s.com/doug > >> >> > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > >> -- > > CTO, OpenSource Connections > > Author, Relevant Search > > http://o19s.com/doug > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >