Yes that is another good area (there are many). Although of course
embeddings have their own challenges and complexities. (they often capture
shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a
framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <joaquin.delg...@gmail.com>
wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries that
>> can address other cases where related terms occupy the same position but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have) the
>> pattern of clients jamming any related term into a synonyms file and being
>> surprised with odd results. I like the idea of enforcing "synonyms" means
>> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
>> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
>> place works great.
>>
>> But I feel if that's the rule, we need to open up discussion of other
>> methods of scoring conceptual 'related term' relationships that usually
>> comes up in the context of query expansion. This paper (
>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
>> the current thinking for scoring various query expansion scenarios like
>> those we deal with in the messy, ambiguous uses of synonyms in prod systems
>> (khakis aren't trousers, they're a kind-of trouser).
>>
>>
>> The cool thing is many of the ideas in this paper seem doable with
>> existing Lucene index stats. So one might imagine a 'related terms' token
>> filter that injected some scoring based on how related it really is to
>> the original query term using Jaccard, Dice, or other methods called out in
>> this paper.
>>
>>
>> Another insightful set of research is this article on concept scoring (
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> ), which prioritizes related terms by connectedness and other factors.
>>
>> Needless to say, it's an open area how two terms someone has asserted are
>> related to a query term 'should be' scored. It's one of those things that
>> likely will forever depend on a number of domain and application specific
>> factors. It's possibly a big opportunity of improvement for Lucene - but
>> likely is about putting the right framework in place to allow for good
>> default set of query-expansion scoring scenarios with options for
>> customization.
>>
>> What I'm proposing is:
>>
>>
>>    -
>>
>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>    "SYNONYM" in the same posn, which allows some short term work to be done
>>    with the current Lucene QueryBuilder. Any additional non-synonym terms
>>    would be appended as a boolean query for now
>>    -
>>
>>    Begin work on alternate 'related-term' scoring systems that also key
>>    off the token type in QueryBuilder to create custom scoring using built-in
>>    term stats. The possibilities here are endless, up to weighted related
>>    terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, 
>> etc
>>
>>
>> I'm curious what folks would think of a patch for bullet one followed by
>> other patches down the road for additional functionality?
>>
>> (related to discussion in this Elasticsearch PR
>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> )
>>
>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug

Reply via email to