Re: Proposal: Adding Case-Insensitive Support to TermsInSetQuery

Robert Muir Wed, 12 Mar 2025 10:43:10 -0700

It wouldn't be reasonably efficient unless you are able to make a
case-insensitive version of Automata.makeStringUnion.


For this case, just add a lowercased field to the index. It is a
search index not a database...

On Wed, Mar 12, 2025 at 12:53 PM Michael Froh <msf...@gmail.com> wrote:
>
> I think this is an interesting idea.
>
> I'm not sure if a normalizer would be sufficient, because I think it would 
> require that the indexed terms are already normalized.
>
> Given that TermInSetQuery already implements MultiTermQuery, it's already in 
> the family of queries that matches terms using an automaton (though TISQuery 
> uses PrefixCodedTerms to generate a TermIterator, which gets wrapped in a 
> FilteredTermsEnum to intersect with indexed terms). I wonder if it would make 
> sense to just build an automaton for this case? The good news is that you 
> could probably prototype that with a case-insensitive RegexpQuery over the 
> union of terms and see if the resulting automaton gobbles up all the memory 
> and/or takes forever to run. I think the number of nodes in the automaton 
> would probably be roughly 2x the total number of characters across all terms. 
> Maybe it would be possible to find common (case-insensitive) prefixes to 
> produce a more compact automaton? (That might be where normalizing the query 
> terms would help.)
>
> Thanks!
> Froh
>
>
> On Wed, Mar 12, 2025 at 8:07 AM Will Dickerson <will.e.dicker...@gmail.com> 
> wrote:
>>
>> Hi all,
>>
>> I’d like to start a discussion about adding case-insensitive matching 
>> support to TermsInSetQuery. Currently, Elasticsearch’s terms query, which 
>> maps to TermsInSetQuery in Lucene, does not support case insensitivity. This 
>> limitation has led to user requests for case-insensitive matching in 
>> Elasticsearch (e.g., this issue).
>>
>> Problem Statement
>>
>> Unlike TermQuery, which supports case_insensitive: true, TermsInSetQuery 
>> does not, meaning users must preprocess their data at index time.
>> This affects use cases like email lookups, usernames, and case-insensitive 
>> identifiers, where exact case preservation is required but searches must 
>> remain case insensitive.
>>
>> Proposed Solution
>>
>> Extend TermsInSetQuery to optionally apply a normalizer (e.g., 
>> LowercaseFilter) before executing lookups.
>> Alternatively, introduce a new query type (e.g., CaseInsensitiveTermsQuery) 
>> to handle this efficiently.
>>
>> Considerations
>>
>> The previous discussion in Elasticsearch mentioned concerns about query 
>> expansion if case normalization required rewriting into a BooleanQuery.
>> A possible mitigation is applying normalization only once per term before 
>> execution.
>>
>> Would the team be open to discussing this further? If this approach makes 
>> sense, I’d be happy to explore implementation details and submit a proof of 
>> concept.
>>
>> Thanks,
>> Will Dickerson

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Proposal: Adding Case-Insensitive Support to TermsInSetQuery

Reply via email to