jasperjiaguo opened a new issue, #9666:
URL: https://github.com/apache/pinot/issues/9666
In `LuceneTextIndexCreator` we are now hardcoding the stop words for Lucene
text index.
```
Arrays.asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
"if", "in", "into", "is", "it", "no",
"not", "of", "on", "or", "such", "that", "the", "their", "then",
"than", "there", "these", "they", "this",
"to", "was", "will", "with", "those"),
```
These words will get pruned out during the text index generation as well as
filter (in `StandardAnalyzer`). The problem with this is in production we found
users issuing queries like
`SELECT ... FROM ignoreMe WHERE TEXT_MATCH(title, '"IT staff" OR "IT
manager"')`
as will actually give the result matching `TEXT_MATCH(title, '"staff" OR
"manager"')`This can be easily reproduced in `TextSearchQueriesTest`.
Meanwhile, there is a TODO item of making LUCENE_INDEX_MAX_BUFFER_SIZE_MB.
These two changes can be evaluated/made together.
cc @Jackie-Jiang @walterddr @siddharthteotia @SabrinaZhaozyf
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]