Hi all, I've implemented a proof of concept and submitted it as PR #14349: https://github.com/apache/lucene/pull/14349
The implementation takes an approach based on Michael's suggestion, using an automaton-based solution rather than term expansion. The implementation avoids expanding terms into all possible case variations by creating a single case-insensitive automaton that matches all provided terms. I welcome your feedback on this approach, particularly regarding: 1. Performance characteristics with large term sets 2. The chosen implementation approach vs. alternatives 3. Any edge cases I may have missed in the test coverage Looking forward to your thoughts! Best, Will On Wed, Mar 12, 2025 at 12:43 PM Robert Muir <rcm...@gmail.com> wrote: > It wouldn't be reasonably efficient unless you are able to make a > case-insensitive version of Automata.makeStringUnion. > > For this case, just add a lowercased field to the index. It is a > search index not a database... > > On Wed, Mar 12, 2025 at 12:53 PM Michael Froh <msf...@gmail.com> wrote: > > > > I think this is an interesting idea. > > > > I'm not sure if a normalizer would be sufficient, because I think it > would require that the indexed terms are already normalized. > > > > Given that TermInSetQuery already implements MultiTermQuery, it's > already in the family of queries that matches terms using an automaton > (though TISQuery uses PrefixCodedTerms to generate a TermIterator, which > gets wrapped in a FilteredTermsEnum to intersect with indexed terms). I > wonder if it would make sense to just build an automaton for this case? The > good news is that you could probably prototype that with a case-insensitive > RegexpQuery over the union of terms and see if the resulting automaton > gobbles up all the memory and/or takes forever to run. I think the number > of nodes in the automaton would probably be roughly 2x the total number of > characters across all terms. Maybe it would be possible to find common > (case-insensitive) prefixes to produce a more compact automaton? (That > might be where normalizing the query terms would help.) > > > > Thanks! > > Froh > > > > > > On Wed, Mar 12, 2025 at 8:07 AM Will Dickerson < > will.e.dicker...@gmail.com> wrote: > >> > >> Hi all, > >> > >> I’d like to start a discussion about adding case-insensitive matching > support to TermsInSetQuery. Currently, Elasticsearch’s terms query, which > maps to TermsInSetQuery in Lucene, does not support case insensitivity. > This limitation has led to user requests for case-insensitive matching in > Elasticsearch (e.g., this issue). > >> > >> Problem Statement > >> > >> Unlike TermQuery, which supports case_insensitive: true, > TermsInSetQuery does not, meaning users must preprocess their data at index > time. > >> This affects use cases like email lookups, usernames, and > case-insensitive identifiers, where exact case preservation is required but > searches must remain case insensitive. > >> > >> Proposed Solution > >> > >> Extend TermsInSetQuery to optionally apply a normalizer (e.g., > LowercaseFilter) before executing lookups. > >> Alternatively, introduce a new query type (e.g., > CaseInsensitiveTermsQuery) to handle this efficiently. > >> > >> Considerations > >> > >> The previous discussion in Elasticsearch mentioned concerns about query > expansion if case normalization required rewriting into a BooleanQuery. > >> A possible mitigation is applying normalization only once per term > before execution. > >> > >> Would the team be open to discussing this further? If this approach > makes sense, I’d be happy to explore implementation details and submit a > proof of concept. > >> > >> Thanks, > >> Will Dickerson > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >