Hi all,

I've implemented a proof of concept and submitted it as PR #14349:
https://github.com/apache/lucene/pull/14349

The implementation takes an approach based on Michael's suggestion, using
an automaton-based solution rather than term expansion. The implementation
avoids expanding terms into all possible case variations by creating a
single case-insensitive automaton that matches all provided terms.

I welcome your feedback on this approach, particularly regarding:

   1. Performance characteristics with large term sets
   2. The chosen implementation approach vs. alternatives
   3. Any edge cases I may have missed in the test coverage


Looking forward to your thoughts!

Best,
Will

On Wed, Mar 12, 2025 at 12:43 PM Robert Muir <rcm...@gmail.com> wrote:

> It wouldn't be reasonably efficient unless you are able to make a
> case-insensitive version of Automata.makeStringUnion.
>
> For this case, just add a lowercased field to the index. It is a
> search index not a database...
>
> On Wed, Mar 12, 2025 at 12:53 PM Michael Froh <msf...@gmail.com> wrote:
> >
> > I think this is an interesting idea.
> >
> > I'm not sure if a normalizer would be sufficient, because I think it
> would require that the indexed terms are already normalized.
> >
> > Given that TermInSetQuery already implements MultiTermQuery, it's
> already in the family of queries that matches terms using an automaton
> (though TISQuery uses PrefixCodedTerms to generate a TermIterator, which
> gets wrapped in a FilteredTermsEnum to intersect with indexed terms). I
> wonder if it would make sense to just build an automaton for this case? The
> good news is that you could probably prototype that with a case-insensitive
> RegexpQuery over the union of terms and see if the resulting automaton
> gobbles up all the memory and/or takes forever to run. I think the number
> of nodes in the automaton would probably be roughly 2x the total number of
> characters across all terms. Maybe it would be possible to find common
> (case-insensitive) prefixes to produce a more compact automaton? (That
> might be where normalizing the query terms would help.)
> >
> > Thanks!
> > Froh
> >
> >
> > On Wed, Mar 12, 2025 at 8:07 AM Will Dickerson <
> will.e.dicker...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I’d like to start a discussion about adding case-insensitive matching
> support to TermsInSetQuery. Currently, Elasticsearch’s terms query, which
> maps to TermsInSetQuery in Lucene, does not support case insensitivity.
> This limitation has led to user requests for case-insensitive matching in
> Elasticsearch (e.g., this issue).
> >>
> >> Problem Statement
> >>
> >> Unlike TermQuery, which supports case_insensitive: true,
> TermsInSetQuery does not, meaning users must preprocess their data at index
> time.
> >> This affects use cases like email lookups, usernames, and
> case-insensitive identifiers, where exact case preservation is required but
> searches must remain case insensitive.
> >>
> >> Proposed Solution
> >>
> >> Extend TermsInSetQuery to optionally apply a normalizer (e.g.,
> LowercaseFilter) before executing lookups.
> >> Alternatively, introduce a new query type (e.g.,
> CaseInsensitiveTermsQuery) to handle this efficiently.
> >>
> >> Considerations
> >>
> >> The previous discussion in Elasticsearch mentioned concerns about query
> expansion if case normalization required rewriting into a BooleanQuery.
> >> A possible mitigation is applying normalization only once per term
> before execution.
> >>
> >> Would the team be open to discussing this further? If this approach
> makes sense, I’d be happy to explore implementation details and submit a
> proof of concept.
> >>
> >> Thanks,
> >> Will Dickerson
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to