aglinxinyuan opened a new issue, #5654: URL: https://github.com/apache/texera/issues/5654
## Background `CaseSensitiveAnalyzer` in `common/workflow-operator` (`operator.keywordSearch`) currently lacks a unit-spec. It is a Lucene `Analyzer` subclass used by the keyword-search operator when the user opts into case-sensitive matching — the abstraction skips the lowercasing pipeline used by `StandardAnalyzer`, so a regression here would silently downgrade case-sensitive search. ## Behavior to pin | Surface | Contract | | --- | --- | | `tokenStream` of `"Hello World"` | produces the tokens `"Hello"` and `"World"` (case preserved) | | Mixed-case input (`"FooBar BazQux"`) | preserves case in every emitted token | | Whitespace tokenization | splits on whitespace; tabs and newlines also split | | Punctuation handling | matches `WhitespaceTokenizer` semantics — punctuation is part of adjacent tokens (e.g. `"abc,def"` becomes one token, not two) | | Empty input | produces no tokens | | Pure-whitespace input | produces no tokens | | `StopFilter` with empty stop-word set | retains every token (no filtering applied) | | Two distinct field names | each `tokenStream(fieldName, ...)` returns its own independent `TokenStream` (no shared state across fields) | ## Scope - New spec file: `CaseSensitiveAnalyzerSpec.scala` (matches the `<srcClassName>Spec.scala` convention). - No production-code changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
