aglinxinyuan opened a new issue, #5654:
URL: https://github.com/apache/texera/issues/5654

   ## Background
   
   `CaseSensitiveAnalyzer` in `common/workflow-operator` 
(`operator.keywordSearch`) currently lacks a unit-spec. It is a Lucene 
`Analyzer` subclass used by the keyword-search operator when the user opts into 
case-sensitive matching — the abstraction skips the lowercasing pipeline used 
by `StandardAnalyzer`, so a regression here would silently downgrade 
case-sensitive search.
   
   ## Behavior to pin
   
   | Surface | Contract |
   | --- | --- |
   | `tokenStream` of `"Hello World"` | produces the tokens `"Hello"` and 
`"World"` (case preserved) |
   | Mixed-case input (`"FooBar BazQux"`) | preserves case in every emitted 
token |
   | Whitespace tokenization | splits on whitespace; tabs and newlines also 
split |
   | Punctuation handling | matches `WhitespaceTokenizer` semantics — 
punctuation is part of adjacent tokens (e.g. `"abc,def"` becomes one token, not 
two) |
   | Empty input | produces no tokens |
   | Pure-whitespace input | produces no tokens |
   | `StopFilter` with empty stop-word set | retains every token (no filtering 
applied) |
   | Two distinct field names | each `tokenStream(fieldName, ...)` returns its 
own independent `TokenStream` (no shared state across fields) |
   
   ## Scope
   
   - New spec file: `CaseSensitiveAnalyzerSpec.scala` (matches the 
`<srcClassName>Spec.scala` convention).
   - No production-code changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to