Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).
It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap... On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com> wrote: > What about the use of word embeddings (see > > https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) > to compute word similarity? > > On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < > dturnb...@opensourceconnections.com> wrote: > >> Hey folks, >> >> I wanted to open up a discussion about a change to the usage of >> SynonymQuery. The goal here is to have a broader library of queries that >> can address other cases where related terms occupy the same position but >> don't have the same meaning (such as hypernyms, hyponyms, meronyms, >> ambiguous terms, and other query expansion situations). >> >> >> I bring this up because we've noticed (as I'm sure many of you have) the >> pattern of clients jamming any related term into a synonyms file and being >> surprised with odd results. I like the idea of enforcing "synonyms" means >> exactly-the-same in Lucene-land. It's an easy thing to tell a client and >> setup simple patterns. So for synonyms, I think leaving SynonymQuery in >> place works great. >> >> But I feel if that's the rule, we need to open up discussion of other >> methods of scoring conceptual 'related term' relationships that usually >> comes up in the context of query expansion. This paper ( >> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys >> the current thinking for scoring various query expansion scenarios like >> those we deal with in the messy, ambiguous uses of synonyms in prod systems >> (khakis aren't trousers, they're a kind-of trouser). >> >> >> The cool thing is many of the ideas in this paper seem doable with >> existing Lucene index stats. So one might imagine a 'related terms' token >> filter that injected some scoring based on how related it really is to >> the original query term using Jaccard, Dice, or other methods called out in >> this paper. >> >> >> Another insightful set of research is this article on concept scoring ( >> https://usabilityetc.com/articles/information-retrieval-concept-matching/ >> ), which prioritizes related terms by connectedness and other factors. >> >> Needless to say, it's an open area how two terms someone has asserted are >> related to a query term 'should be' scored. It's one of those things that >> likely will forever depend on a number of domain and application specific >> factors. It's possibly a big opportunity of improvement for Lucene - but >> likely is about putting the right framework in place to allow for good >> default set of query-expansion scoring scenarios with options for >> customization. >> >> What I'm proposing is: >> >> >> - >> >> Submit a small patch that restricts SynonymQuery to tokens of type >> "SYNONYM" in the same posn, which allows some short term work to be done >> with the current Lucene QueryBuilder. Any additional non-synonym terms >> would be appended as a boolean query for now >> - >> >> Begin work on alternate 'related-term' scoring systems that also key >> off the token type in QueryBuilder to create custom scoring using built-in >> term stats. The possibilities here are endless, up to weighted related >> terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, >> etc >> >> >> I'm curious what folks would think of a patch for bullet one followed by >> other patches down the road for additional functionality? >> >> (related to discussion in this Elasticsearch PR >> >> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 >> ) >> >> -- >> CTO, OpenSource Connections >> Author, Relevant Search >> http://o19s.com/doug >> > -- CTO, OpenSource Connections Author, Relevant Search http://o19s.com/doug