It wouldn't be reasonably efficient unless you are able to make a case-insensitive version of Automata.makeStringUnion.
For this case, just add a lowercased field to the index. It is a search index not a database... On Wed, Mar 12, 2025 at 12:53 PM Michael Froh <msf...@gmail.com> wrote: > > I think this is an interesting idea. > > I'm not sure if a normalizer would be sufficient, because I think it would > require that the indexed terms are already normalized. > > Given that TermInSetQuery already implements MultiTermQuery, it's already in > the family of queries that matches terms using an automaton (though TISQuery > uses PrefixCodedTerms to generate a TermIterator, which gets wrapped in a > FilteredTermsEnum to intersect with indexed terms). I wonder if it would make > sense to just build an automaton for this case? The good news is that you > could probably prototype that with a case-insensitive RegexpQuery over the > union of terms and see if the resulting automaton gobbles up all the memory > and/or takes forever to run. I think the number of nodes in the automaton > would probably be roughly 2x the total number of characters across all terms. > Maybe it would be possible to find common (case-insensitive) prefixes to > produce a more compact automaton? (That might be where normalizing the query > terms would help.) > > Thanks! > Froh > > > On Wed, Mar 12, 2025 at 8:07 AM Will Dickerson <will.e.dicker...@gmail.com> > wrote: >> >> Hi all, >> >> I’d like to start a discussion about adding case-insensitive matching >> support to TermsInSetQuery. Currently, Elasticsearch’s terms query, which >> maps to TermsInSetQuery in Lucene, does not support case insensitivity. This >> limitation has led to user requests for case-insensitive matching in >> Elasticsearch (e.g., this issue). >> >> Problem Statement >> >> Unlike TermQuery, which supports case_insensitive: true, TermsInSetQuery >> does not, meaning users must preprocess their data at index time. >> This affects use cases like email lookups, usernames, and case-insensitive >> identifiers, where exact case preservation is required but searches must >> remain case insensitive. >> >> Proposed Solution >> >> Extend TermsInSetQuery to optionally apply a normalizer (e.g., >> LowercaseFilter) before executing lookups. >> Alternatively, introduce a new query type (e.g., CaseInsensitiveTermsQuery) >> to handle this efficiently. >> >> Considerations >> >> The previous discussion in Elasticsearch mentioned concerns about query >> expansion if case normalization required rewriting into a BooleanQuery. >> A possible mitigation is applying normalization only once per term before >> execution. >> >> Would the team be open to discussing this further? If this approach makes >> sense, I’d be happy to explore implementation details and submit a proof of >> concept. >> >> Thanks, >> Will Dickerson --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org