raghavyadav01 opened a new pull request, #16276:
URL: https://github.com/apache/pinot/pull/16276
## Summary
This PR adds the support for case-insensitive regex matching in FST (Finite
State Transducer) LUCENE indexes while maintaining backward compatibility with
existing case-sensitive FST implementations.
## FST Behavior: Input/Output Types
### Existing FST Implementation
- **Input Type**: BYTE4 (UTF-16 encoded strings)
- **Output Type**: LONG (single dictionary ID)
- **Mapping**: One key → One value (e.g., "Hello" → 1)
### Problem with Case-Insensitive Requirements
For case-insensitive matching, we need to map multiple case variations to
the same normalized key:
- "Hello" → 1
- "HELLO" → 2
- "hello" → 3
**Challenge**: The existing LONG output type can only store a single value
per key, but we need to store multiple dictionary IDs for the same normalized
key.
## Solution: Use BytesRef to Store Multiple Values
### New Case-Insensitive FST Implementation
- **Input Type**: BYTE4 (UTF-16 encoded strings, normalized to lowercase)
- **Output Type**: BytesRef (serialized list of dictionary IDs)
- **Mapping**: One normalized key → Multiple values (e.g., "hello" → [1, 2,
3])
### Magic Header Detection
To automatically distinguish between case-sensitive and case-insensitive
FSTs:
The reader checks the first 4 bytes to determine the FST type:
- If magic header = "FSTI" → Read as BytesRef (case-insensitive)
- If magic header = "\fsa" → Read as Long (case-sensitive, backward
compatibility)
## Backward Compatibility
**Existing segments continue to work as-is with no changes required.**
- All existing case-sensitive FST segments remain fully functional
- No migration needed for current deployments
- New case-insensitive FSTs are automatically detected via magic header
## Sample Configuration
```json
{
"tableName": "user_logs",
"fieldConfigList": [
{
"name": "domain_name",
"encodingType": "DICTIONARY",
"indexes": {
"fst": {
"type": "LUCENE",
"caseSensitive": false
}
}
}
]
}
```
## Sample Query
```sql
SELECT domain_name, COUNT(*)
FROM user_logs
WHERE REGEXP_LIKE(domain_name, 'WWW.EXAMPLE.*')
GROUP BY domain_name
```
**Response:**
```json
{
"resultTable": {
"dataSchema": {
"columnNames": ["domain_name", "count(*)"],
"columnDataTypes": ["STRING", "LONG"]
},
"rows": [
["www.example.com", 100],
["WWW.EXAMPLE.ORG", 50],
["www.Example.net", 75]
]
}
}
```
## Testing
- Added case-insensitive FST tests
(`FSTBasedCaseInsensitiveRegexpLikeQueriesTest.java`)
- Enhanced existing case-sensitive tests to ensure backward compatibility
- Added FST builder tests for both modes
- Added configuration serialization/deserialization tests
## Breaking Changes
None. This change is fully backward compatible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]