I think this is an interesting idea.

I'm not sure if a normalizer would be sufficient, because I think it would
require that the indexed terms are already normalized.

Given that TermInSetQuery already implements MultiTermQuery, it's already
in the family of queries that matches terms using an automaton (though
TISQuery uses PrefixCodedTerms to generate a TermIterator, which gets
wrapped in a FilteredTermsEnum to intersect with indexed terms). I wonder
if it would make sense to just build an automaton for this case? The good
news is that you could probably prototype that with a case-insensitive
RegexpQuery over the union of terms and see if the resulting automaton
gobbles up all the memory and/or takes forever to run. I think the number
of nodes in the automaton would probably be roughly 2x the total number of
characters across all terms. Maybe it would be possible to find common
(case-insensitive) prefixes to produce a more compact automaton? (That
might be where normalizing the query terms would help.)

Thanks!
Froh


On Wed, Mar 12, 2025 at 8:07 AM Will Dickerson <will.e.dicker...@gmail.com>
wrote:

> Hi all,
>
> I’d like to start a discussion about adding case-insensitive matching
> support to TermsInSetQuery. Currently, Elasticsearch’s terms query, which
> maps to TermsInSetQuery in Lucene, does not support case insensitivity.
> This limitation has led to user requests for case-insensitive matching in
> Elasticsearch (e.g., this issue
> <https://github.com/elastic/elasticsearch/issues/71520>).
> Problem Statement
>
>    - Unlike TermQuery, which supports case_insensitive: true,
>    TermsInSetQuery does not, meaning users must preprocess their data at
>    index time.
>    - This affects use cases like email lookups, usernames, and
>    case-insensitive identifiers, where exact case preservation is required but
>    searches must remain case insensitive.
>
> Proposed Solution
>
>    - Extend TermsInSetQuery to optionally apply a normalizer (e.g.,
>    LowercaseFilter) before executing lookups.
>    - Alternatively, introduce a new query type (e.g.,
>    CaseInsensitiveTermsQuery) to handle this efficiently.
>
> Considerations
>
>    - The previous discussion in Elasticsearch mentioned concerns about
>    query expansion if case normalization required rewriting into a
>    BooleanQuery.
>    - A possible mitigation is applying normalization only once per term
>    before execution.
>
> Would the team be open to discussing this further? If this approach makes
> sense, I’d be happy to explore implementation details and submit a proof of
> concept.
>
> Thanks,
> Will Dickerson
>

Reply via email to